High Availability with SAS Grid Manager 9.4M5: New Possibilities With Custom Service Monitoring

1 Like

SAS customers appreciate SAS® Grid Manager for the many benefits that it provides to their SAS environment. Enhancing the availability of their critical services is certainly one of the most popular benefits.

Starting with SAS 9.4M5, SAS Grid Manager for Platform includes LSF 10.x, which provides new capabilities in the realm of high availability:

custom service monitoring via a health check script
enhanced management of service dependencies
application affinity

In this post I'll cover the first point, custom service monitoring.

SAS Grid Manager module for SAS Environment Manager

If you are an experienced grid administrator, you may be used to manual editing of grid configuration files, or using RTM as a web interface to manage your grid.

There is a third way; you can use the SAS Grid Manager module for SAS Environment Manager to manage and configure resources on a SAS grid, including high-availability applications.

First introduced with SAS 9.4M2 with a limited set of capabilities, it has evolved over its releases to become the tool of election for grid administrators.

The home page of the SAS Grid Manager module. The four tiles group its capabilities.

The SAS Global Forum paper SAS® Grid Administration Made Simple describes in detail how to use the SAS Grid Manager Module for SAS Environment Manager, including the HA Configuration Manager.

HA Configuration Manager

When you log on with the LSF Administrator user credentials, you can use the HA Configuration Manager module to create and maintain HA configurations. For example, these could include sets of definitions of how to start, stop, and monitor services.

Starting with SAS 9.4M5, the HA Configuration Manager supports new capabilities delivered by LSF 10, such as:

custom service monitoring via a health check script
enhanced management of service dependencies
application affinity

Custom Service Monitoring

When you define a highly available service, you have to enter in the configuration page which are the scripts to start and stop the service. These scripts are used by the grid to control the service on your behalf.

The SAS 9.4M5 release includes support for an optional third script, which is used to monitor the health status of the service.

The new health check script parameters on the Execution Settings page

The purposes of this script are the following:

If you leave the field empty, the grid behaves just like it did in the previous releases. It monitors the service at the operating-system level (i.e., it asks the operating system whether the main process of the service is still there). Although this method is effective in detecting processes that die or machines that fail, it cannot detect the case in which a service is still there but has become unresponsive.
If you specify a script, the grid runs it immediately after starting the service. The script keeps running and is expected to output a status code once every N seconds, as specified in the max update interval field.

Here are the supported status codes that the script can output and their meaning:

Status code	Meaning	How does the grid react
TENTATIVE	The service is still starting or initializing.	Wait.
READY	The service is up and running.	Mark the service as ACTIVE and, if configured, start dependent services.
ERROR	The service has failed.	Issue the service stop command, then try to restart it.
END	The monitor script exits and will stop monitoring the service.	The service remains in the current state.

This new capability can be more effective in detecting services that have become unresponsive, but it might cause an unwanted failover if the max update interval parameter is not properly configured. If, for example, the monitored service is busy servicing user requests, it can simply take longer than usual to respond to status requests. Is this a case of failure or should the time-out simply be increased? It is your call, as a grid administrator, to decide. What’s important is that the software gives you the tools to implement whatever decision you take.

Not only this new capability provides better monitoring possibilities, it also helps with managing service dependencies.

The following diagram compares service statuses with the traditional grid HA configuration versus using a custom monitoring script.

Blue arrows: startup sequence. Red arrows: error detected. Black arrows: after issuing the stop command.

To better understand what this means, let's walk through a real use case. An Object Spawner requires a Metadata Server, so, after defining both as HA services in our grid, you decide to set a dependency so that the grid starts the Object Spawner only after the Metadata Server is active.

With the traditional configuration, the grid marks the Metadata Server as Active as soon as the process is up and running. Problem is, the Metadata Server may still be initializing and not yet listening for incoming connections. The grid does not know this and immediately tries to start the Object Spawner, which obviously fails. To overcome this issue there are possible workarounds, such as modifying the Object Spawner startup script to include a wait or a check on the Metadata Server, or using external orchestration tools/scripts instead of configuring service dependency within the grid.

A better solution comes with the new custom monitoring option.

You can write a custom script so that it simply calls the default "metadataserver status" script, which is aware of the actual Metadata Server status. So, it will not return OK until that's actually listening, after completing all its initialization. By using a custom monitoring script, you inform the grid when the service is ACTIVE, so the grid will start the dependent Object Spawner only when it's really time to.

You can find more details, including an example of a custom monitoring script, in the SAS Global Forum paper, SAS® Grid Manager for High Availability: Implementation Best Practices.