BookmarkSubscribeRSS Feed

How to create HA services with SAS Workload Orchestrator in Linux (SAS Grid Manager)

Started ‎03-05-2020 by
Modified ‎04-06-2020 by
Views 3,538

Note of the author: Updated to include the required changes for the SAS Connect Spawners, the SAS Web Infrastructure databases (WIP) and some considerations for Azure.

 

When you get SAS Grid Manager, it promises to deliver workload balancing, high-availability and faster processing in a flexible, centrally managed grid computing environment.

 

Until now, we have seen how SAS implemented SAS Grid Manager for Hadoop, and SAS Grid Manager for Platform, with the IBM Platform Suite technology (LSF and EGO, as we popularly name it). Now with SAS 9.4 M6, there is an exciting brand new distribution, SAS Grid Manager, which features SAS Workload Orchestrator (SWO) and SAS Job Flow Scheduler (SJFS).

 

SAS Workload Orchestrator is new and, as such, can take a bit to implement, in comparison to SAS Grid Manager for Platform, a more mature version running for years. However you might get involved working on it...simply because it was decided at other levels, because it is an easy one to implement and install, or just because you are a brave one! In any case, as it is a relatively new product, some details are still being outlined.

 

With this article, I will share an example of how to implement High Availability and Automatic Fail-over for one of the most common services, the Spawners. We will start with the Object Spawner, then I will explain you the differences for the Connect Spawners. It is a given, of course, that you will need in place a working installation of SAS Grid Manager on Linux and you should know the basics.

 

Let us go into details.

 

In the links at the bottom of this article, you will find the SAS documentation related to the High Availability configuration details for SAS Workload Orchestrator. 

 

1. Use the Sample Script (Object Spawner)

 

Start with the sample script provided by SAS. Then configure in the first block the details, including the service you want to configure as highly available, with its start parameter. I named this script as objspwn_utf8_ha.sh

 

 

name="arbitrary_name_of_your_sas_service"
script=`basename "$0"`
now=`date +%Y-%m-%d@%H:%M:%S`
command="/path_to_sasconfig/Lev1/ObjectSpawner/ObjectSpawner.sh start"
thisHost=`hostname`
log_filename="/tmp/sas/swo/${name}.${thisHost}.log"
pid_filename="/tmp/sas/swo/${name}.${thisHost}.pid"

 

 

Once done, you should be able to call this script with the start, stop, status and restart parameters with successful results. It is, however, a process. Wait for it...

 

2. Create the HA service in SAS Workload Orchestrator

 

Then, in SWO, create the service:

 

  • Provide this service a name: e.g sas_obj_spwn_utf8_ha_svc
  • Optionally select if you want it to autostart when smgh.sh starts
  • The "number of instances" is intended for selecting the level of HA for this service and its automatic fail-over: active-active or active-passive. Check out the High Availability SAS Papers and presentation slides referred at the bottom of this article.
    • For the Object Spawner, an active-active model works best. You can put a value equal to the number of grid nodes, or lower
    • For active-passive models, such as the WIP database, a value of just "1" is enough
  • Point SWO to the location of the sample script you created earlier
  • Set the user and password that should run the script, by default, sasinst
  • And select those hosts where you want SWO to check if the service is running

 

SWO_HA_objspwn.PNG

 

Then  save the changes and check status in the Services area, where you can also start the HA service or stop/disable it.

 

I don't know about you, but I found this configuration to be fairly simple - definitely much easier than with LSF/EGO!

 

But, wait, that is not enough. If you pay attention, in the Services area, the new HA service will stay as ADDED, when it should be RUNNING. You need to take care of a few things first. 

 

3. Customization to the Sample Script (Object Spawner)

 

One of the reasons why the sample script will not work out-of-the-box is because it is just that, a sample script.

 

If you pay attention to the script, it is all based on working with the PID of the process you execute. More specifically:

 

 

nohup $command > $log_filename 2>&1 &
pid=$!

 

Considering the command you are executing is "ObjectSpawner.sh start", "pid=$!" will try to capture the PID of ObjectSpawner.sh.

 

Unfortunately, ObjectSpawner.sh is not the real Object Spawner process that will be running in our SAS environment, it is only a wrapper which will call the real one, and does more things.

 

This means this PID is not useful, you will need the PID of the final Object Spawner process. Luckily, the ObjectSpawner.sh is creating a PID file which contains the PID number you need:

 

 

eval "nohup $COMMAND $CMD_OPTIONS -sasSpawnerCn \"$SPWNNAME\" -xmlconfigfile $OMRCFG -logconfigloc $CONFIGDIR/logconfig.xml ${USERMODS}> $LOGSDIR/ObjectSpawner_console_${HOSTNAME}.log 2>&1 &"
         pid=$!
         echo $pid > $CONFIGDIR/$SERVER_PID_FILE_NAME

Wonderful! 

 

This allows you to make use of this in your sample script. The main consideration here is that the script can read the Object Spawner PID file but, under no circumstances could you would write it. Denied.

 

How can you do it? Here is my current implementation. If you have better approaches, go ahead and post your proposed modifications in the comments below.

 

Remember the nohup command and the pid=$! of those 2 lines? Well, you will comment out the pid=$! and below the nohup command include the block surrounded by " # Modify to pick up the PID generated by the Object Spawner" and "# End of block":

 

 

nohup $command > $log_filename 2>&1 &
#pid=$!
#echo $pid > $pid_filename

  # Modify to pick up the PID generated by the Object Spawner
  sleep 1
  spwn_pid_filename=/path_to_sasconfig/config/Lev1/ObjectSpawner/server.${thisHost}.pid
  spwn_pid=`cat $spwn_pid_filename`
  echo $spwn_pid > $pid_filename
  # End of block
    echo "${now} ${script}: Service ${name} (pid $pid) is started"

 

Remember that in the first step, you could not validate the script with the status, stop, start and restart? Well, now you should be able to. Go ahead and test it now.

 

4. See HA in action in SAS Workload Orchestrator!

 

Now, in theory, if all parameters are OK in SWO, you should be able to start the HA service or stop it from the SWO GUI, and SWO will capture the status all the time.

 

You might need to validate that the magic is actually happening at all levels. Please check manually that you can validate the following actions:

 

  • you are able to start your service in the Services area and if its status changes from ADDED to RUNNING
  • for above's action, if the Object Spawner indeed starts in the background, check it with ObjectSpawner.sh status
  • if you stop or kill one of the Object Spawners in one of the grid nodes (ObjectSpawner.sh stop or sudo kill -9 PID), SWO should automatically start it, which you can check from SWO and ObjectSpawner.sh status
  • if you can stop your service from SWO, then indeed in shell you can check that Object Spawners are stopped

If all above conditions are true for you ... You are good to go!

 

5. Considerations for the SAS Connect Spawner

 

More good news. When you plan to implement high availability for your SAS Connect Spawners, there are not many changes in those customizations to take into account. The reson for this is the fact that the ConnectSpawner.sh wrapper created by SAS follows a close implementation as the ObjectSpawner.sh

 

eval "nohup $COMMAND $CMD_OPTIONS -sasSpawnerCn \"$SPWNNAME\" -xmlconfigfile $OMRCFG -logconfigloc $CONFIGDIR/logconfig.xml ${USERMODS}> $LOGSDIR/ObjectSpawner_console_${HOSTNAME}.log 2>&1 &"
         pid=$!
         echo $pid > $CONFIGDIR/$SERVER_PID_FILE_NAME

 

Thanks to this implementation, we will be able to capture the PID following exactly the same method as described above, for the SAS Object Spawner.

 

A summary:

 

1. You can copy one of the sample scripts customized that we created for one of the Object Spawners.

2. Customize it for the Connect Spawner:

name="arbitrary_name_of_your_sas_connect_service"
command="/path_to_sasconfig/Lev1/ConnectSpawner/ConnectSpawner.sh start"
 spwn_pid_filename=/opt/sas/comp/sasconfig/Lev1/ConnectSpawner/server.${thisHost}.pid

3. Test the script 

4. Create the HA service in SWO

5. Test the script with SWO GUI

 

That is mainly all we need. Once you know how to do it once, it is actually an actually fairly simple implementation!

 

6. Considerations for the SAS Web Infrastructure database

The webinfdsvrc.sh script for the SAS WIP database is another wrap for the actual service, which means we will be able to use a method similar as earlier, but with a couple of extra considerations:

 

A. The WIP database method for high availability / clustering / fail-over is Active-Passive. It should run in only one node at at time, otherwise we will run the risk to corrupt our database.

 

B. The sample script for HA service will need a couple of extra considerations. The PID file is not exactly as the one for the Spawners, but also we will need an extra "sleep" command.

 

Having those two considerations in mind, let me go a bit further in detail. Following similar guidelines as for the Spawners:

 

1. You can copy one of the sample scripts customized that we created for one of the Object Spawners.

2. Customize it for the WIP database:

name="arbitrary_name_of_your_sas_connect_service"
command="/path_to_sasconfig/Lev1/WebInfrastructurePlatformDataServer/webinfdsvrc.sh start"

3. One more customization for the WIP database:

##**********************************************************
## Start the Service
##**********************************************************
[lines of code]
nohup $command > $log_filename 2>&1 &
# Modify to pick up the PID generated - Juan Sanchez
#pid=$!
#echo $pid > $pid_filename
sleep 1
spwn_pid_filename=/opt/sas/comp/sasconfig/Lev1/WebInfrastructurePlatformDataServer/data/postmaster.pid
spwn_pid=`cat $spwn_pid_filename`
echo $spwn_pid > $pid_filename
#
echo "${now} ${script}: Service ${name} (pid $pid) is started"

4. Make a good backup of your WIP database!

3. Test the script 

4. Create the HA service in SWO. 2 considerations:

     a. Important: for Active-Passive clusters, set "Number of instances" to 1.

     b. Disclaimer for Azure and for any environment with a Load Balancer that will not allow connections from one node to itself:

  • The SAS Deployment Backup and Recovery tool will run the backup of the WIP db from the first node, by default, where you installed SAS compute. This means for our SAS platform, that the scheduled SAS backups will fail for WIP if WIP runs on the first node.
  • To prevent this situation, in SWO we will set up the "Host names" listing every Grid node, except for the first one. Then SWO will spawn WIP db in every other node except the first one, and the SAS Backup will run on the first node as default. Of course, you can take any other approach that will work for you. Please do not hesitate to share your ideas in the comments!

5. Test the script with SWO GUI

 

If you have more databases from the multiple SAS solutions, you will be able to convert them into highly available services in no time with this approach!.

 

Please note: it is documented by SAS how to provide HA to WIP through https://support.sas.com/resources/papers/Managing-WIP-DataServerforHA.pdf . However, the present guide will not follow this approach.

 

For whom it might help, the mental process behind this decision was: a) this implementation is easier and shorter; b) the documented method still holds a SPOF in the pgpool service; c) as it works in Master-Slave mode, the documented approach requires a WIP database per Grid node, then either those databases have local storage enough large or you place the databases in the shared storage creating higher workload to the Shared Storage, decreasing performance and significant space. 

 

Discussion

 

In my particular case, this was not ideally enough for the first implementation of my first Object Spawner (I've got 6 of them), but this was due to a typo I committed in the username, a space at the end, not recognized by the SWO GUI.

 

I managed to resolve it with a bit of troubleshooting which I will describe in my next article, thanks and kudos to @doug_sas  for his support. I will refer to it in this article once the next one is published.

 

Once I implemented and validated the first one, the rest of Object Spawners I could enable and validate in a matter of a handful of minutes. If you just copy your validated script and make a few modifications, it will make the life easy to you. E.g:

 

 

cp /sas_application/sasdata/sasadmin/swo_scripts/objspwn_utf8_ha.sh /sas_application/sasdata/sasadmin/swo_scripts/objspwn_latin1_ha.sh

 

 

Then a few modifications to the script:

 

name="sas_obj_spwn_latin1_ha_svc"
command="/opt/sas/comp/sasconfig/Lev1/ObjectSpawnerLatin1/ObjectSpawner.sh start"

spwn_pid_filename=/sas_application/sasconfig/comp/config/Lev1/ObjectSpawnerLatin1/server.${thisHost}.pid

 

Of course this change can be automated with a bit of prep work ... even easier creating a variable for the path and the Object Spawner name as in the file system.

 

After that, the creation of the HA service in SWO is as easy as described in the step 2 described in this article.

 

As we could see, the can use the same approach for the Object Spawners and the Connect Spawners. And a very similar one as well for the WIP database, with a couple of extra considerations.

 

As closing remarks, and as you could see above, it is important to consider the provided sample script just a sample, an initial guide. When you want to implement HA for a service, you will need to do a little exploration of the architectural particularities of that service, to be able to integrate it with the sample script, and SAS Workload Orchestrator. You want to check that the recognition of the correct PID and ensuring the functions start, stop, status and restart will do as expected. Therefore a bit of SAS knowledge and Linux scripting is required, to not speak of a lot of curiosity.

 

With this guideline we have covered the most critical services of the SAS Compute tier, by using the SAS Workload Orchestrator, our new SAS Grid Manager: the Object Spawners, the Connect Spawners and the WIP database.

 

In addition to those, we would need to set up highly available services for the scheduling services, such as SAS Launcher and SAS Job Flow Scheduler, however, at the time of writing this article, HA is not supported for the SWO's scheduling services. In other hand, SAS R&D is currently working on it, and I will update this article as soon as I have news on this topic. For now, you can make use any other Scheduler tool at your hand.

 

I hope this can be useful for future implementations of High Available services with SAS Workload Orchestrator.

 

Please do not hesitate to contact me or share your comments below! I would love to learn from your experiences and implementations.

 

Related links, not SAS Communities related:

 

Next articles to come:

  • How to troubleshoot HA services with SAS Workload Orchestrator in Linux (SAS Grid Manager) - Pending Publishing
  • How to set convert SAS Launcher and SAS Job Flow Scheduler into HA services - Pending Availability

 

Version history
Last update:
‎04-06-2020 11:54 AM
Updated by:

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags