Note of the author: Updated to include the required changes for the SAS Connect Spawners, the SAS Web Infrastructure databases (WIP) and some considerations for Azure.
When you get SAS Grid Manager, it promises to deliver workload balancing, high-availability and faster processing in a flexible, centrally managed grid computing environment.
Until now, we have seen how SAS implemented SAS Grid Manager for Hadoop, and SAS Grid Manager for Platform, with the IBM Platform Suite technology (LSF and EGO, as we popularly name it). Now with SAS 9.4 M6, there is an exciting brand new distribution, SAS Grid Manager, which features SAS Workload Orchestrator (SWO) and SAS Job Flow Scheduler (SJFS).
SAS Workload Orchestrator is new and, as such, can take a bit to implement, in comparison to SAS Grid Manager for Platform, a more mature version running for years. However you might get involved working on it...simply because it was decided at other levels, because it is an easy one to implement and install, or just because you are a brave one! In any case, as it is a relatively new product, some details are still being outlined.
With this article, I will share an example of how to implement High Availability and Automatic Fail-over for one of the most common services, the Spawners. We will start with the Object Spawner, then I will explain you the differences for the Connect Spawners. It is a given, of course, that you will need in place a working installation of SAS Grid Manager on Linux and you should know the basics.
Let us go into details.
In the links at the bottom of this article, you will find the SAS documentation related to the High Availability configuration details for SAS Workload Orchestrator.
Start with the sample script provided by SAS. Then configure in the first block the details, including the service you want to configure as highly available, with its start parameter. I named this script as objspwn_utf8_ha.sh
name="arbitrary_name_of_your_sas_service"
script=`basename "$0"`
now=`date +%Y-%m-%d@%H:%M:%S`
command="/path_to_sasconfig/Lev1/ObjectSpawner/ObjectSpawner.sh start"
thisHost=`hostname`
log_filename="/tmp/sas/swo/${name}.${thisHost}.log"
pid_filename="/tmp/sas/swo/${name}.${thisHost}.pid"
Once done, you should be able to call this script with the start, stop, status and restart parameters with successful results. It is, however, a process. Wait for it...
Then, in SWO, create the service:
Then save the changes and check status in the Services area, where you can also start the HA service or stop/disable it.
I don't know about you, but I found this configuration to be fairly simple - definitely much easier than with LSF/EGO!
But, wait, that is not enough. If you pay attention, in the Services area, the new HA service will stay as ADDED, when it should be RUNNING. You need to take care of a few things first.
One of the reasons why the sample script will not work out-of-the-box is because it is just that, a sample script.
If you pay attention to the script, it is all based on working with the PID of the process you execute. More specifically:
nohup $command > $log_filename 2>&1 &
pid=$!
Considering the command you are executing is "ObjectSpawner.sh start", "pid=$!" will try to capture the PID of ObjectSpawner.sh.
Unfortunately, ObjectSpawner.sh is not the real Object Spawner process that will be running in our SAS environment, it is only a wrapper which will call the real one, and does more things.
This means this PID is not useful, you will need the PID of the final Object Spawner process. Luckily, the ObjectSpawner.sh is creating a PID file which contains the PID number you need:
eval "nohup $COMMAND $CMD_OPTIONS -sasSpawnerCn \"$SPWNNAME\" -xmlconfigfile $OMRCFG -logconfigloc $CONFIGDIR/logconfig.xml ${USERMODS}> $LOGSDIR/ObjectSpawner_console_${HOSTNAME}.log 2>&1 &"
pid=$!
echo $pid > $CONFIGDIR/$SERVER_PID_FILE_NAME
Wonderful!
This allows you to make use of this in your sample script. The main consideration here is that the script can read the Object Spawner PID file but, under no circumstances could you would write it. Denied.
How can you do it? Here is my current implementation. If you have better approaches, go ahead and post your proposed modifications in the comments below.
Remember the nohup command and the pid=$! of those 2 lines? Well, you will comment out the pid=$! and below the nohup command include the block surrounded by " # Modify to pick up the PID generated by the Object Spawner" and "# End of block"
:
nohup $command > $log_filename 2>&1 &
#pid=$!
#echo $pid > $pid_filename
# Modify to pick up the PID generated by the Object Spawner
sleep 1
spwn_pid_filename=/path_to_sasconfig/config/Lev1/ObjectSpawner/server.${thisHost}.pid
spwn_pid=`cat $spwn_pid_filename`
echo $spwn_pid > $pid_filename
# End of block
echo "${now} ${script}: Service ${name} (pid $pid) is started"
Remember that in the first step, you could not validate the script with the status, stop, start and restart? Well, now you should be able to. Go ahead and test it now.
Now, in theory, if all parameters are OK in SWO, you should be able to start the HA service or stop it from the SWO GUI, and SWO will capture the status all the time.
You might need to validate that the magic is actually happening at all levels. Please check manually that you can validate the following actions:
If all above conditions are true for you ... You are good to go!
More good news. When you plan to implement high availability for your SAS Connect Spawners, there are not many changes in those customizations to take into account. The reson for this is the fact that the ConnectSpawner.sh wrapper created by SAS follows a close implementation as the ObjectSpawner.sh
eval "nohup $COMMAND $CMD_OPTIONS -sasSpawnerCn \"$SPWNNAME\" -xmlconfigfile $OMRCFG -logconfigloc $CONFIGDIR/logconfig.xml ${USERMODS}> $LOGSDIR/ObjectSpawner_console_${HOSTNAME}.log 2>&1 &"
pid=$!
echo $pid > $CONFIGDIR/$SERVER_PID_FILE_NAME
Thanks to this implementation, we will be able to capture the PID following exactly the same method as described above, for the SAS Object Spawner.
A summary:
1. You can copy one of the sample scripts customized that we created for one of the Object Spawners.
2. Customize it for the Connect Spawner:
name="arbitrary_name_of_your_sas_connect_service"
command="/path_to_sasconfig/Lev1/ConnectSpawner/ConnectSpawner.sh start"
spwn_pid_filename=/opt/sas/comp/sasconfig/Lev1/ConnectSpawner/server.${thisHost}.pid
3. Test the script
4. Create the HA service in SWO
5. Test the script with SWO GUI
That is mainly all we need. Once you know how to do it once, it is actually an actually fairly simple implementation!
The webinfdsvrc.sh script for the SAS WIP database is another wrap for the actual service, which means we will be able to use a method similar as earlier, but with a couple of extra considerations:
A. The WIP database method for high availability / clustering / fail-over is Active-Passive. It should run in only one node at at time, otherwise we will run the risk to corrupt our database.
B. The sample script for HA service will need a couple of extra considerations. The PID file is not exactly as the one for the Spawners, but also we will need an extra "sleep" command.
Having those two considerations in mind, let me go a bit further in detail. Following similar guidelines as for the Spawners:
1. You can copy one of the sample scripts customized that we created for one of the Object Spawners.
2. Customize it for the WIP database:
name="arbitrary_name_of_your_sas_connect_service"
command="/path_to_sasconfig/Lev1/WebInfrastructurePlatformDataServer/webinfdsvrc.sh start"
3. One more customization for the WIP database:
##**********************************************************
## Start the Service
##**********************************************************
[lines of code]
nohup $command > $log_filename 2>&1 &
# Modify to pick up the PID generated - Juan Sanchez
#pid=$!
#echo $pid > $pid_filename
sleep 1
spwn_pid_filename=/opt/sas/comp/sasconfig/Lev1/WebInfrastructurePlatformDataServer/data/postmaster.pid
spwn_pid=`cat $spwn_pid_filename`
echo $spwn_pid > $pid_filename
#
echo "${now} ${script}: Service ${name} (pid $pid) is started"
4. Make a good backup of your WIP database!
3. Test the script
4. Create the HA service in SWO. 2 considerations:
a. Important: for Active-Passive clusters, set "Number of instances" to 1.
b. Disclaimer for Azure and for any environment with a Load Balancer that will not allow connections from one node to itself:
5. Test the script with SWO GUI
If you have more databases from the multiple SAS solutions, you will be able to convert them into highly available services in no time with this approach!.
Please note: it is documented by SAS how to provide HA to WIP through https://support.sas.com/resources/papers/Managing-WIP-DataServerforHA.pdf . However, the present guide will not follow this approach.
For whom it might help, the mental process behind this decision was: a) this implementation is easier and shorter; b) the documented method still holds a SPOF in the pgpool service; c) as it works in Master-Slave mode, the documented approach requires a WIP database per Grid node, then either those databases have local storage enough large or you place the databases in the shared storage creating higher workload to the Shared Storage, decreasing performance and significant space.
In my particular case, this was not ideally enough for the first implementation of my first Object Spawner (I've got 6 of them), but this was due to a typo I committed in the username, a space at the end, not recognized by the SWO GUI.
I managed to resolve it with a bit of troubleshooting which I will describe in my next article, thanks and kudos to @doug_sas for his support. I will refer to it in this article once the next one is published.
Once I implemented and validated the first one, the rest of Object Spawners I could enable and validate in a matter of a handful of minutes. If you just copy your validated script and make a few modifications, it will make the life easy to you. E.g:
cp /sas_application/sasdata/sasadmin/swo_scripts/objspwn_utf8_ha.sh /sas_application/sasdata/sasadmin/swo_scripts/objspwn_latin1_ha.sh
Then a few modifications to the script:
name="sas_obj_spwn_latin1_ha_svc"
command="/opt/sas/comp/sasconfig/Lev1/ObjectSpawnerLatin1/ObjectSpawner.sh start"
spwn_pid_filename=/sas_application/sasconfig/comp/config/Lev1/ObjectSpawnerLatin1/server.${thisHost}.pid
Of course this change can be automated with a bit of prep work ... even easier creating a variable for the path and the Object Spawner name as in the file system.
After that, the creation of the HA service in SWO is as easy as described in the step 2 described in this article.
As we could see, the can use the same approach for the Object Spawners and the Connect Spawners. And a very similar one as well for the WIP database, with a couple of extra considerations.
As closing remarks, and as you could see above, it is important to consider the provided sample script just a sample, an initial guide. When you want to implement HA for a service, you will need to do a little exploration of the architectural particularities of that service, to be able to integrate it with the sample script, and SAS Workload Orchestrator. You want to check that the recognition of the correct PID and ensuring the functions start, stop, status and restart will do as expected. Therefore a bit of SAS knowledge and Linux scripting is required, to not speak of a lot of curiosity.
With this guideline we have covered the most critical services of the SAS Compute tier, by using the SAS Workload Orchestrator, our new SAS Grid Manager: the Object Spawners, the Connect Spawners and the WIP database.
In addition to those, we would need to set up highly available services for the scheduling services, such as SAS Launcher and SAS Job Flow Scheduler, however, at the time of writing this article, HA is not supported for the SWO's scheduling services. In other hand, SAS R&D is currently working on it, and I will update this article as soon as I have news on this topic. For now, you can make use any other Scheduler tool at your hand.
I hope this can be useful for future implementations of High Available services with SAS Workload Orchestrator.
Please do not hesitate to contact me or share your comments below! I would love to learn from your experiences and implementations.
Related links, not SAS Communities related:
Next articles to come:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.