Alerting is back in the new SAS Viya (2020.1 and later) , but not as we know it. Previous articles have introduced new logging functions in SAS Viya, and there are also new monitoring components, which make use of established, industry-standard, dedicated third-party applications; all included in the optional SAS Viya Monitoring for Kubernetes project. This add-on framework uses some of these tools to provide a means to access metrics, create alerts, view logs, and perform other Kubernetes cluster monitoring functions at your SAS site. At the core of the monitoring framework is Prometheus, which is responsible for capturing and storing system metrics. In this article, we'll look at the alerting framework built over these metrics and demonstrate how Prometheus AlertManager can be used to create alert rules in SAS Viya to trigger notifications when metrics meet specified conditions.
First, a quick look under the covers.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Prometheus collects metrics from SAS by querying each container's metrics endpoint. Grafana queries the data collected to provide a visual monitoring interface. The AlertManager piece is a Prometheus component that is responsible for handling alerts. The functionality it provides includes listening for firing alerts, grouping and deduplicating alerts, sending notifications to designated channels, and more. We'll discuss these and other functions and look at how they work later in this article.
Prometheus metrics can be queried using the Prometheus UI with PromQL, Prometheus' native querying language. PromQL expressions form the basis for a PrometheusRule (a Custom Resource Definition). That is, any condition we want to alert on must be expressed as a PromQL query in a PrometheusRule definition. For example, assume we want to fire an alert when pods in our SAS Viya namespace consume more than 20% of all available memory on a node in the cluster. The first step is to define the rule as a block in a new YAML file.
tee PrometheusRule.yaml > /dev/null << EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: prometheus-operator-prometheus
role: alert-rules
name: prometheus-viya4-rules
namespace: ops4viyamon
spec:
groups:
- name: viya4-nodes
rules:
- alert: Viya4NSMemoryUsage
annotations:
description: Total Viya 4 namespace memory usage is more than 80% of total memory capacity.
summary: Alerting on Viya 4 namespace Memory usage
expr: ((sum by (node) (container_memory_usage_bytes{container!~"POD",namespace="gelcorp",pod=~"sas-.+"})) / (sum by (node) (kube_node_status_capacity{resource="memory"})) * 100) > 80
labels:
severity: critical
EOF
Note the expression that performs the calculation. Total memory used by all SAS pods (RegEx-matched) in my namespace ("gelcorp") as a percentage of total memory capacity of the cluster nodes. There are many complex expressions that can be created with PromQL. There are hundreds of metrics that can be queried, and PromQL has advanced functions for calculating aggregations like averages over time or predicting future metric values based on historic values. The rule is created with a kubectl command: kubectl create --filename ./PrometheusRule.yaml
In SAS 9, escalation schemas were used to define what happens when an alert starts firing. In SAS Viya, to specify who gets notified and how, alert notifications are defined in the AlertManager configuration. Alert notifications from AlertManager can be distributed in several ways, including via email, via third party systems such as Slack or MS Teams, or using webhooks to alert in other applications. First, a YAML file containing the necessary configuration must be created. We need to configure a receiver, and set up the routing. Routing can be used to send different alerts to different channels by using labels for alerts and matching them to the receivers (the designated channels). For instance, it may be necessary to send warnings to only a subset of admins, whereas critical alerts might go to all users. In the the simple example below, all alert notifications will simply be sent to my designated receiver, but a routing tree could also be defined here instead.
tee alertmanager.yaml > /dev/null << EOF
global:
smtp_smarthost: pdcesx02190.race.sas.com:1025
smtp_from: 'alertmanager@gelcorp.com'
smtp_require_tls: false
route:
receiver: sas-admins-email-alert
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receivers:
- name: sas-admins-email-alert
email_configs:
- to: sasadmin@gelcorp.com
headers:
Subject: 'COMPUTER SAYS NO'
send_resolved: true
EOF
Grouping is a mechanism for grouping together alerts of a similar nature into a single notification. Labels can be used to assign each alert a 'group', and then referencing that group in the config file with a group_by
parameter in the route
block. We'll look more closely at routing and label-matching to groups in a future article. The AlertManager configuration is stored as a secret, and in order for the configuration to be updated with the contents of the YAML file above, we must update the secret. First, encode the YAML with base64. cat ./alertmanager.yaml | base64 -w0
Take the resulting encoded string and paste it in a new YAML file, which will be used to update the 'alertmanager-v4m-alertmanager' secret.
tee alertmanager-secret.yaml > /dev/null << EOF
apiVersion: v1
data:
alertmanager.yaml: Z2xvYmFsOg0KICAjIHJlc29sdmVfdGltZW91dDogNW0NCiAgc210cF9zbWFydGhvc3Q6IHBkY2VzeDAyMTkwLnJhY2Uuc2FzLmNvbToxMDI1DQogIHNtdHBfZnJvbTogJ2FsZXJ0bWFuYWdlckBleGFtcGxlLm9yZycNCiAgc210cF9yZXF1aXJlX3RsczogZmFsc2UNCg0Kcm91dGU6DQogICMgZ3JvdXBfYnk6IFtBbGVydG5hbWVdDQogICMgU2VuZCBhbGwgbm90aWZpY2F0aW9ucyB0byBtZS4NCiAgcmVjZWl2ZXI6IHNhcy1hZG1pbnMtZW1haWwtYWxlcnQNCiAgZ3JvdXBfd2FpdDogMzBzDQogIGdyb3VwX2ludGVydmFsOiA1bQ0KICByZXBlYXRfaW50ZXJ2YWw6IDEyaA0KICAjIHJvdXRlczoNCiAgIyAtIG1hdGNoOg0KICAjICAgIGFsZXJ0bmFtZTogQ29udGFpbmVyQ3B1VXNhZ2UNCiAgIyAgcmVjZWl2ZXI6ICdzYXMtYWRtaW5zLWVtYWlsLWFsZXJ0Jw0KDQpyZWNlaXZlcnM6DQotIG5hbWU6IHNhcy1hZG1pbnMtZW1haWwtYWxlcnQNCiAgZW1haWxfY29uZmlnczoNCiAgLSB0bzogd2hhdEB3aGF0bm93LmNvbQ0KICAgICMgZnJvbTogbm9yZXBseV92aXlhQHNhcy5jb20NCiAgICAjIFlvdXIgc210cCBzZXJ2ZXIgYWRkcmVzcw0KICAgICMgc21hcnRob3N0OiBwZGNlc3gwMjE5MC5yYWNlLnNhcy5jb206MTAyNQ0KICAgICMgYXV0aF91c2VybmFtZTogd2hhdEB3aGF0bm93LmNvbQ0KICAgICMgYXV0aF9pZGVudGl0eTogd2hhdEB3aGF0bm93LmNvbQ0KICAgIGhlYWRlcnM6DQogICAgIyAgRnJvbTogbm9yZXBseV92aXlhQHNhcw0KICAgICAgU3ViamVjdDogJ0RlbW8gQUxFUlQnDQogICAgICBzZW5kX3Jlc29sdmVkOiB0cnVlDQo=
kind: Secret
metadata:
name: alertmanager-v4m-alertmanager
namespace: ops4viyamon
type: Opaque
EOF
Then run the command below to update the secret. kubectl apply -f ./alertmanager-secret.yaml
A quick way to check the configuration has been updated is by logging on to the AlertManager UI and heading to the Status tab.
Now, when alert conditions are met, the state of the alert changes to firing...
... and the alert notification is sent to the receiver.
So what does an admin do when they receive notifications? Obviously, some action needs to be performed to stop the alert from firing, but alerts can also be silenced before any remediation work begins. Silencing is simply a means of muting an alert for a given period (for example, while the issue causing the alert to fire is being addressed). Alerts can be silenced from the AlertManager UI - firing alerts will appear in red. Another related function is inhibition. Inhibition refers to the suppression of alerts when other specified alerts are firing. Using the example from the Prometheus doc, there would be little sense in firing alerts about unreachable pods if the entire cluster itself is unreachable. For this kind of scenario, an inhibition rule could be defined in the AlertManager configuration to specify those alerts for which notifications should be sent and those that should be suppressed. Alert manager also automatically performs alert de-duplication. That is, it fires one alert for one issue, rather than firing a new alert each time it re-checks the alert condition. If an alert is firing, it will appear as one line item (in red) in the UI, and it will remain there until the issue is remediated or the alert is silenced.
The mythological Prometheus was the god of forethought. With Prometheus AlertManager, administrators can proactively monitor their SAS Viya environment to anticipate issues and receive prompts to fix things before they become bigger problems. For more information, refer to the official Prometheus doc page. Also be sure to download and deploy the SAS Viya Monitoring for Kubernetes project at your site to get started with monitoring, logging and alerting.
Thank you for reading. I hope the information provided in this article has been helpful. Please leave a comment below to ask questions or share your own experiences.
Find more articles from SAS Global Enablement and Learning here.
Hi Amjal,
Looks like the link here: "SAS Viya Monitoring for Kubernetes project" is to a SAS gitlab repo rather than a public github one? Really useful examples too 👍
Alan
Thanks Ajmal, are there any plans to include anything on the MAS prometheus feed in the Viya4 github? (example dashboards for instance) Although the online docs suggests it's not available in the last release:
release 2021.1.1 shows the 'management' section, but release 2021.1 doesn't.
Alan
Hi @alancox,
In 2021.1 and later, MAS metrics will automatically be exported into Prometheus format and integrated with the SAS Viya Monitoring for Kubernetes solution. Those "management" settings (or any other additional configuration settings) are therefore not required.
Unfortunately, there are no pre-built Grafana dashboards specifically for MAS at this stage, but depending on what information you'd like to see, you might find the other dashboards that are included useful.
Thanks!
Thanks for the update @AjmalFarzam
I can see details (& blogs of course) on the new ways logs & metrics are available in Viya4. My question is really on the continuation of the MAS metrics currently available in Viya3.5, rather than the generic JVM type ones that are generated. I've already been using a tweaked version of this for instance to monitor multiple Viya3.5 microservices.
These MAS metrics are in addition to those that come out of the other microservices:
# HELP mas_module_execution_seconds # TYPE mas_module_execution_seconds summary mas_module_execution_seconds_count{module="echo",tenant="provider",} 6.0 mas_module_execution_seconds_sum{module="echo",tenant="provider",} 0.066334917 # HELP mas_module_execution_seconds_max # TYPE mas_module_execution_seconds_max gauge mas_module_execution_seconds_max{module="echo",tenant="provider",} 0.0 # HELP mas_module_score_max describes how scores are distributed # TYPE mas_module_score_max gauge mas_module_score_max{module="echo",tenant="provider",} 0.0 # HELP mas_module_score_max decribes how scores are distributed # TYPE mas_module_score_max gauge mas_module_score_max{module="echo",tenant="provider",} 0.0 # HELP mas_module_score decribes how scores are distributed # TYPE mas_module_score summary mas_module_score_count{module="echo",tenant="provider",} 2.0 mas_module_score_sum{module="echo",tenant="provider",} 300.0 # HELP mas_module_availability # TYPE mas_module_availability gauge mas_module_availability{module="lmedt22s6fzaztjs5unrc2ryfzq",tenant="provider",} 1.0 mas_module_availability{module="hmeq_value",tenant="provider",} 1.0 mas_module_availability{module="dcm_treatments_ c2eca78a_2624_4744_9ffd_e238eec02ec2",tenant="provider",} 1.0 mas_module_availability{module="logistic_regression_ 06e7bffb_89f3_4775_8bd8_029793970f91",tenant="provider",} 1.0 mas_module_availability{module="sql_executer",tenant="provider",} 1.0 mas_module_availability{module="echo",tenant="provider",} 1.0 # HELP mas_modules_availability # TYPE mas_modules_availability gauge mas_modules_availability{tenant="provider",} 1.0 # HELP mas_core_memoryused # TYPE mas_core_memoryused gauge mas_core_memoryused 8.33036288E8
These are invaluable in creating dashboards to show how MAS is performing real-time - are these continued into Viya4? They have gone from the online docs as far as I can tell.
Regards
Alan
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.