Children of the Routing Tree: Alert Notifications in SAS Viya

2 Likes

Prometheus has a rather interesting origin story. It was developed at SoundCloud, where it was built in-house to address their specific production monitoring needs, before becoming a Cloud Native Computing Foundation project along with Kubernetes. Both have since been widely adopted by many organizations as production-grade enterprise software, and both play key roles in the next generation of SAS Viya. In an earlier article, we introduced the alerting facility that is provided by Prometheus as part of the SAS Viya Monitoring for Kubernetes framework. We looked at the mechanics and saw how AlertManager can be used to create alerts based on metrics accumulated by Prometheus. In this post, we'll take a deeper dive into the creation of alerts, with a focus on the alert routing process.

What is routing?

What happens when defined alert conditions are met and alerts start firing? Routing is the term used for the process of determining the paths that a firing alert takes in order to determine what happens to it (think "escalation schemes" for resource alerts in SAS 9). The example we saw in the previous post was a typical one - the sending of an email to an administrator when an alert condition is met. The routing instructed AlertManager to evaluate the alert against a number of routes (specified in the configuration). This allows each alert to be individually assessed and for an action relevant to that particular alert to be taken.

Where is routing defined?

Routing is defined in the form of a 'routing tree' in the AlertManager configuration. There is default route (root route) and there can be zero or multiple child (nested) routes. An alert enters the routing tree at the top, and is calculated against each route. When it matches and enters a route, it finds the receiver for that route, which defines the action to be performed. Depending on how the routing tree is configured, it can then either exit the tree and conclude the routing process, or it can continue traversing the tree looking for more routes it can enter (refer to the continue attribute in the diagram below).

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

What is a receiver?

A receiver is simply the destination where a firing alert ends up after entering a route. Each route in the routing tree has an associated receiver. When a firing alert is sent to a route, the receiver for that route is instructed to carry out a pre-determined action, such as sending an email notification. An alert might match one or more routes and could be sent to one or more receivers (for example, a firing alert could send an email to the SAS administrator as well as a Telegram message to the Kubernetes administrator).

I'll try my best to explain with a football analogy. The quarterback (AlertManager) receives the ball (a firing alert) from a snap and either hands it off to a running back to make a run, or throws it to make a pass; these are the routes an alert can take. The players receiving the ball are the receivers; each route has one, and when the ball is delivered to them, they perform a pre-defined task (run towards the endzone and try not to die!). In AlertManager though, each receiver can perform a different action, and an alert can pass through multiple routes. To control how alerts traverse through the routing tree and get to receivers, we use alert labels.

What are labels and how are they used?

Labels are simply descriptive tags that can be added to an alert rule when it is created (note that some labels are generated automatically from the Prometheus metrics e.g. namespace, pod, or service). When the alert fires, it enters the routing tree at the root. All alerts match with the root route. The alert then navigates through the tree, and may match with and enter none or several of the child routes. Whether an alert enters a child route or not is determined by the label filters defined for the child route. In the example above, the child routes each have a label filter - firing alerts will only match the first child route if they are labelled with team: sasdevs, and will only match the second route if they are labelled with severity: critical. If an alert has both of these labels, it goes to both receivers. Automatic labels such as namespace: viya-prod can also be used, and matching can also be defined using regular expressions. Refer to the documentation for more information.

If, for instance, the alert rule looks like the example below, it will match with the second child route, and it will be sent to the sas-prod-admins receiver. As a matter of good practice, it is best to first decide on some standard labels to use when creating alert rules (e.g. severity, team name, or environment).

What other actions can a firing alert trigger?

Receivers can send emails, or they can use webhooks to send notifications (via HTTP POST method) in other ways. For example, a message can be sent to an MS Teams channel, or a text message or push notification can be sent with tools like PagerDuty. Instructions for configuring these connections are provided in AlertManager doc page. When these notifications are received by an administrator/engineer, they typically need some manual triage and remediation steps to be performed. However, this process can be up-levelled from a manual, reactive process to one where actions are taken automatically. There are a number of tools, many of them open-source, that provide a way to automatically execute tasks when an alert is fired. For example, consider a firing alert triggering the running of a CLI command, the automatic creation of a ServiceNow ticket, or the execution of a Jenkins pipeline. Check out prometheus-alert-webhooker and alertmanager-webhook-servicenow for some examples.

More information

Check out my other articles that discuss other aspects of alert management.

Thanks for reading. I hope the information provided has been helpful. Leave a comment below to ask questions or share your own experiences.

Find more articles from SAS Global Enablement and Learning here.