SAS customers appreciate SAS® Grid Manager for the many benefits that it provides to their SAS environment. Enhancing the availability of their critical services is certainly one of the most popular benefits.
Starting with SAS 9.4M5, SAS Grid Manager for Platform includes LSF 10.x, which provides new capabilities in the realm of high availability:
In this post I'll cover the first point, custom service monitoring.
If you are an experienced grid administrator, you may be used to manual editing of grid configuration files, or using RTM as a web interface to manage your grid.
There is a third way; you can use the SAS Grid Manager module for SAS Environment Manager to manage and configure resources on a SAS grid, including high-availability applications.
First introduced with SAS 9.4M2 with a limited set of capabilities, it has evolved over its releases to become the tool of election for grid administrators.
The home page of the SAS Grid Manager module. The four tiles group its capabilities.
The SAS Global Forum paper SAS® Grid Administration Made Simple describes in detail how to use the SAS Grid Manager Module for SAS Environment Manager, including the HA Configuration Manager.
When you log on with the LSF Administrator user credentials, you can use the HA Configuration Manager module to create and maintain HA configurations. For example, these could include sets of definitions of how to start, stop, and monitor services.
Starting with SAS 9.4M5, the HA Configuration Manager supports new capabilities delivered by LSF 10, such as:
When you define a highly available service, you have to enter in the configuration page which are the scripts to start and stop the service. These scripts are used by the grid to control the service on your behalf.
The SAS 9.4M5 release includes support for an optional third script, which is used to monitor the health status of the service.
The new health check script parameters on the Execution Settings page
The purposes of this script are the following:
Here are the supported status codes that the script can output and their meaning:
Status code | Meaning | How does the grid react |
---|---|---|
TENTATIVE | The service is still starting or initializing. | Wait. |
READY | The service is up and running. | Mark the service as ACTIVE and, if configured, start dependent services. |
ERROR | The service has failed. | Issue the service stop command, then try to restart it. |
END | The monitor script exits and will stop monitoring the service. | The service remains in the current state. |
This new capability can be more effective in detecting services that have become unresponsive, but it might cause an unwanted failover if the max update interval parameter is not properly configured. If, for example, the monitored service is busy servicing user requests, it can simply take longer than usual to respond to status requests. Is this a case of failure or should the time-out simply be increased? It is your call, as a grid administrator, to decide. What’s important is that the software gives you the tools to implement whatever decision you take.
Not only this new capability provides better monitoring possibilities, it also helps with managing service dependencies.
The following diagram compares service statuses with the traditional grid HA configuration versus using a custom monitoring script.
Blue arrows: startup sequence. Red arrows: error detected. Black arrows: after issuing the stop command.
To better understand what this means, let's walk through a real use case. An Object Spawner requires a Metadata Server, so, after defining both as HA services in our grid, you decide to set a dependency so that the grid starts the Object Spawner only after the Metadata Server is active.
With the traditional configuration, the grid marks the Metadata Server as Active as soon as the process is up and running. Problem is, the Metadata Server may still be initializing and not yet listening for incoming connections. The grid does not know this and immediately tries to start the Object Spawner, which obviously fails. To overcome this issue there are possible workarounds, such as modifying the Object Spawner startup script to include a wait or a check on the Metadata Server, or using external orchestration tools/scripts instead of configuring service dependency within the grid.
A better solution comes with the new custom monitoring option.
You can write a custom script so that it simply calls the default "metadataserver status" script, which is aware of the actual Metadata Server status. So, it will not return OK until that's actually listening, after completing all its initialization. By using a custom monitoring script, you inform the grid when the service is ACTIVE, so the grid will start the dependent Object Spawner only when it's really time to.
You can find more details, including an example of a custom monitoring script, in the SAS Global Forum paper, SAS® Grid Manager for High Availability: Implementation Best Practices.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.