A first look at alerting in SAS Viya

2 Likes

Alerting is back in the new SAS Viya (2020.1 and later) , but not as we know it. Previous articles have introduced new logging functions in SAS Viya, and there are also new monitoring components, which make use of established, industry-standard, dedicated third-party applications; all included in the optional SAS Viya Monitoring for Kubernetes project. This add-on framework uses some of these tools to provide a means to access metrics, create alerts, view logs, and perform other Kubernetes cluster monitoring functions at your SAS site. At the core of the monitoring framework is Prometheus, which is responsible for capturing and storing system metrics. In this article, we'll look at the alerting framework built over these metrics and demonstrate how Prometheus AlertManager can be used to create alert rules in SAS Viya to trigger notifications when metrics meet specified conditions.

Prometheus AlertManager

First, a quick look under the covers.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Prometheus collects metrics from SAS by querying each container's metrics endpoint. Grafana queries the data collected to provide a visual monitoring interface. The AlertManager piece is a Prometheus component that is responsible for handling alerts. The functionality it provides includes listening for firing alerts, grouping and deduplicating alerts, sending notifications to designated channels, and more. We'll discuss these and other functions and look at how they work later in this article.

Creating Alerts using PromQL

Prometheus metrics can be queried using the Prometheus UI with PromQL, Prometheus' native querying language. PromQL expressions form the basis for a PrometheusRule (a Custom Resource Definition). That is, any condition we want to alert on must be expressed as a PromQL query in a PrometheusRule definition. For example, assume we want to fire an alert when pods in our SAS Viya namespace consume more than 20% of all available memory on a node in the cluster. The first step is to define the rule as a block in a new YAML file.

tee PrometheusRule.yaml > /dev/null << EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: prometheus-operator-prometheus
    role: alert-rules
  name: prometheus-viya4-rules
  namespace: ops4viyamon
spec:
  groups:
  - name: viya4-nodes
    rules:
    - alert: Viya4NSMemoryUsage
      annotations:
        description: Total Viya 4 namespace memory usage is more than 80% of total memory capacity.
        summary: Alerting on Viya 4 namespace Memory usage
      expr: ((sum by (node) (container_memory_usage_bytes{container!~"POD",namespace="gelcorp",pod=~"sas-.+"})) / (sum by (node) (kube_node_status_capacity{resource="memory"})) * 100) > 80
      labels:
        severity: critical
EOF

Note the expression that performs the calculation. Total memory used by all SAS pods (RegEx-matched) in my namespace ("gelcorp") as a percentage of total memory capacity of the cluster nodes. There are many complex expressions that can be created with PromQL. There are hundreds of metrics that can be queried, and PromQL has advanced functions for calculating aggregations like averages over time or predicting future metric values based on historic values. The rule is created with a kubectl command: kubectl create --filename ./PrometheusRule.yaml

Alert notifications

In SAS 9, escalation schemas were used to define what happens when an alert starts firing. In SAS Viya, to specify who gets notified and how, alert notifications are defined in the AlertManager configuration. Alert notifications from AlertManager can be distributed in several ways, including via email, via third party systems such as Slack or MS Teams, or using webhooks to alert in other applications. First, a YAML file containing the necessary configuration must be created. We need to configure a receiver, and set up the routing. Routing can be used to send different alerts to different channels by using labels for alerts and matching them to the receivers (the designated channels). For instance, it may be necessary to send warnings to only a subset of admins, whereas critical alerts might go to all users. In the the simple example below, all alert notifications will simply be sent to my designated receiver, but a routing tree could also be defined here instead.

tee alertmanager.yaml > /dev/null << EOF
global:
  smtp_smarthost: pdcesx02190.race.sas.com:1025
  smtp_from: 'alertmanager@gelcorp.com'
  smtp_require_tls: false

route:
  receiver: sas-admins-email-alert
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h

receivers:
- name: sas-admins-email-alert
  email_configs:
  - to: sasadmin@gelcorp.com
    headers:
      Subject: 'COMPUTER SAYS NO'
      send_resolved: true
EOF

Grouping is a mechanism for grouping together alerts of a similar nature into a single notification. Labels can be used to assign each alert a 'group', and then referencing that group in the config file with a group_by parameter in the route block. We'll look more closely at routing and label-matching to groups in a future article. The AlertManager configuration is stored as a secret, and in order for the configuration to be updated with the contents of the YAML file above, we must update the secret. First, encode the YAML with base64. cat ./alertmanager.yaml | base64 -w0 Take the resulting encoded string and paste it in a new YAML file, which will be used to update the 'alertmanager-v4m-alertmanager' secret.

tee alertmanager-secret.yaml > /dev/null << EOF
apiVersion: v1
data:
  alertmanager.yaml: Z2xvYmFsOg0KICAjIHJlc29sdmVfdGltZW91dDogNW0NCiAgc210cF9zbWFydGhvc3Q6IHBkY2VzeDAyMTkwLnJhY2Uuc2FzLmNvbToxMDI1DQogIHNtdHBfZnJvbTogJ2FsZXJ0bWFuYWdlckBleGFtcGxlLm9yZycNCiAgc210cF9yZXF1aXJlX3RsczogZmFsc2UNCg0Kcm91dGU6DQogICMgZ3JvdXBfYnk6IFtBbGVydG5hbWVdDQogICMgU2VuZCBhbGwgbm90aWZpY2F0aW9ucyB0byBtZS4NCiAgcmVjZWl2ZXI6IHNhcy1hZG1pbnMtZW1haWwtYWxlcnQNCiAgZ3JvdXBfd2FpdDogMzBzDQogIGdyb3VwX2ludGVydmFsOiA1bQ0KICByZXBlYXRfaW50ZXJ2YWw6IDEyaA0KICAjIHJvdXRlczoNCiAgIyAtIG1hdGNoOg0KICAjICAgIGFsZXJ0bmFtZTogQ29udGFpbmVyQ3B1VXNhZ2UNCiAgIyAgcmVjZWl2ZXI6ICdzYXMtYWRtaW5zLWVtYWlsLWFsZXJ0Jw0KDQpyZWNlaXZlcnM6DQotIG5hbWU6IHNhcy1hZG1pbnMtZW1haWwtYWxlcnQNCiAgZW1haWxfY29uZmlnczoNCiAgLSB0bzogd2hhdEB3aGF0bm93LmNvbQ0KICAgICMgZnJvbTogbm9yZXBseV92aXlhQHNhcy5jb20NCiAgICAjIFlvdXIgc210cCBzZXJ2ZXIgYWRkcmVzcw0KICAgICMgc21hcnRob3N0OiBwZGNlc3gwMjE5MC5yYWNlLnNhcy5jb206MTAyNQ0KICAgICMgYXV0aF91c2VybmFtZTogd2hhdEB3aGF0bm93LmNvbQ0KICAgICMgYXV0aF9pZGVudGl0eTogd2hhdEB3aGF0bm93LmNvbQ0KICAgIGhlYWRlcnM6DQogICAgIyAgRnJvbTogbm9yZXBseV92aXlhQHNhcw0KICAgICAgU3ViamVjdDogJ0RlbW8gQUxFUlQnDQogICAgICBzZW5kX3Jlc29sdmVkOiB0cnVlDQo=
kind: Secret
metadata:
  name: alertmanager-v4m-alertmanager
  namespace: ops4viyamon
type: Opaque
EOF

Then run the command below to update the secret. kubectl apply -f ./alertmanager-secret.yaml A quick way to check the configuration has been updated is by logging on to the AlertManager UI and heading to the Status tab.

Now, when alert conditions are met, the state of the alert changes to firing...

... and the alert notification is sent to the receiver.

Silencing and other actions

So what does an admin do when they receive notifications? Obviously, some action needs to be performed to stop the alert from firing, but alerts can also be silenced before any remediation work begins. Silencing is simply a means of muting an alert for a given period (for example, while the issue causing the alert to fire is being addressed). Alerts can be silenced from the AlertManager UI - firing alerts will appear in red. Another related function is inhibition. Inhibition refers to the suppression of alerts when other specified alerts are firing. Using the example from the Prometheus doc, there would be little sense in firing alerts about unreachable pods if the entire cluster itself is unreachable. For this kind of scenario, an inhibition rule could be defined in the AlertManager configuration to specify those alerts for which notifications should be sent and those that should be suppressed. Alert manager also automatically performs alert de-duplication. That is, it fires one alert for one issue, rather than firing a new alert each time it re-checks the alert condition. If an alert is firing, it will appear as one line item (in red) in the UI, and it will remain there until the issue is remediated or the alert is silenced.

More Information

The mythological Prometheus was the god of forethought. With Prometheus AlertManager, administrators can proactively monitor their SAS Viya environment to anticipate issues and receive prompts to fix things before they become bigger problems. For more information, refer to the official Prometheus doc page. Also be sure to download and deploy the SAS Viya Monitoring for Kubernetes project at your site to get started with monitoring, logging and alerting.

Thank you for reading. I hope the information provided in this article has been helpful. Please leave a comment below to ask questions or share your own experiences.

Find more articles from SAS Global Enablement and Learning here.

alancox · ‎06-11-2021

Hi Amjal,

Looks like the link here: "SAS Viya Monitoring for Kubernetes project" is to a SAS gitlab repo rather than a public github one? Really useful examples too 👍

Alan

AjmalFarzam · ‎06-11-2021

Good catch, Alan. I've now fixed the link - sorry about that. Thank you!

alancox · ‎06-11-2021

Thanks Ajmal, are there any plans to include anything on the MAS prometheus feed in the Viya4 github? (example dashboards for instance) Although the online docs suggests it's not available in the last release:

release 2021.1.1 shows the 'management' section, but release 2021.1 doesn't.

Alan

AjmalFarzam · ‎07-08-2021

Apologies for the delay, Alan. I'm trying to track down some answers for you. I'll provide an update when I have more information.
Thanks!

AjmalFarzam · ‎07-14-2021

Hi @alancox,

In 2021.1 and later, MAS metrics will automatically be exported into Prometheus format and integrated with the SAS Viya Monitoring for Kubernetes solution. Those "management" settings (or any other additional configuration settings) are therefore not required.

Unfortunately, there are no pre-built Grafana dashboards specifically for MAS at this stage, but depending on what information you'd like to see, you might find the other dashboards that are included useful.

Thanks!

alancox · ‎07-14-2021

Thanks for the update @AjmalFarzam

I can see details (& blogs of course) on the new ways logs & metrics are available in Viya4. My question is really on the continuation of the MAS metrics currently available in Viya3.5, rather than the generic JVM type ones that are generated. I've already been using a tweaked version of this for instance to monitor multiple Viya3.5 microservices.

These MAS metrics are in addition to those that come out of the other microservices:

# HELP mas_module_execution_seconds  
# TYPE mas_module_execution_seconds summary
mas_module_execution_seconds_count{module="echo",tenant="provider",} 6.0
mas_module_execution_seconds_sum{module="echo",tenant="provider",} 0.066334917

# HELP mas_module_execution_seconds_max  
# TYPE mas_module_execution_seconds_max gauge
mas_module_execution_seconds_max{module="echo",tenant="provider",} 0.0

# HELP mas_module_score_max describes how scores are distributed
# TYPE mas_module_score_max gauge
mas_module_score_max{module="echo",tenant="provider",} 0.0

# HELP mas_module_score_max decribes how scores are distributed
# TYPE mas_module_score_max gauge
mas_module_score_max{module="echo",tenant="provider",} 0.0
# HELP mas_module_score decribes how scores are distributed
# TYPE mas_module_score summary
mas_module_score_count{module="echo",tenant="provider",} 2.0
mas_module_score_sum{module="echo",tenant="provider",} 300.0

# HELP mas_module_availability  
# TYPE mas_module_availability gauge
mas_module_availability{module="lmedt22s6fzaztjs5unrc2ryfzq",tenant="provider",} 
1.0
mas_module_availability{module="hmeq_value",tenant="provider",} 1.0
mas_module_availability{module="dcm_treatments_
c2eca78a_2624_4744_9ffd_e238eec02ec2",tenant="provider",} 1.0
mas_module_availability{module="logistic_regression_
06e7bffb_89f3_4775_8bd8_029793970f91",tenant="provider",} 1.0
mas_module_availability{module="sql_executer",tenant="provider",} 1.0
mas_module_availability{module="echo",tenant="provider",} 1.0

# HELP mas_modules_availability
# TYPE mas_modules_availability gauge
mas_modules_availability{tenant="provider",} 1.0

# HELP mas_core_memoryused
# TYPE mas_core_memoryused gauge
mas_core_memoryused 8.33036288E8

These are invaluable in creating dashboards to show how MAS is performing real-time - are these continued into Viya4? They have gone from the online docs as far as I can tell.

Regards

Alan