Hello everyone,
I am trying to enable High Availability services in the new grid, SAS Workload Orchestrator. For that, the services must handle start/stop/status/restart on a way that SWO will be able to handle (of course). For this, the documentation proposes a “sample script” to wrap the HA-to-be services. As you will see below, nothing really to do with SWO or the services per-se, it is just bash scripting and the identification and handling of the PID.
The documentation: https://documentation.sas.com/?docsetId=gridref&docsetTarget=p02b7o2r85b0kon1rnsq2orm53f5.htm&docset...
However, as far as I can see, the sample script can not identify/store the PID of the child process itself, see piece of code in bold letters (the Object Spawner in this case), when running the start script, I would say it is identifying the "nohup" PID …. which of course makes the rest of the script to fail miserably, as example, when running it with the status parameter.
The bash scripting is used indeed in the standard/best practice way. I googled it and indeed it seems the general recommendation, but it is clear that something fails.
Can anyone help me to identify the required changes to the script?
PS. I also tried without this sample script, just trying to use the sh script from the ObjectSpawner (or the WIP database), but I am not getting better results, but worse.
Thank you in advance!
#!/bin/sh ##************************************************************ ## ## Example script to handle a service ## ##************************************************************ action=$1 name="sas_obj_spwn_utf8_ha_svc" script=`basename "$0"` now=`date +%Y-%m-%d@%H:%M:%S` command="/opt/sas/comp/sasconfig/Lev1/ObjectSpawnerUTF8/ObjectSpawner.sh start" thisHost=`hostname` log_filename="/tmp/sas/swo/${name}.${thisHost}.log" pid_filename="/tmp/sas/swo/${name}.${thisHost}.pid" ##************************************************************ ## ## Define the functions to be used ## ##************************************************************ ##********************************************************** ## Start the Service ##********************************************************** start_service() { if [ -f $pid_filename ]; then pid=`cat $pid_filename` kill -0 $pid > /dev/null 2>&1 if [ $? -eq 0 ]; then echo "${now} ${script}: Service ${name} (pid $pid) is already running" exit 0 fi rm $pid_filename fi nohup $command > $log_filename 2>&1 & pid=$! echo $pid > $pid_filename echo "${now} ${script}: Service ${name} (pid $pid) is started" } ##********************************************************** ## Stop the Service ##********************************************************** stop_service() { if [ -f $pid_filename ]; then pid=`cat $pid_filename` kill $pid > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "${now} ${script}: Service ${name} (pid $pid) could not be stopped" else echo "${now} ${script}: Service ${name} (pid $pid) has been stopped" rm $pid_filename fi else echo "${now} ${script}: Service ${name} is stopped" exit 1 fi } ##********************************************************** ## Get the Service's status ## ## status = 0, everything is OK ## status < 0, temp error, retry 5 times before restarting ## status > 0, error, try restarting ##********************************************************** get_service_status() { if [ -f $pid_filename ]; then pid=`cat $pid_filename` kill -0 $pid > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "${now} ${script}: Service ${name} (pid $pid) is stopped" exit 1 else echo "${now} ${script}: Service ${name} (pid $pid) is running" exit 0 fi else echo "${now} ${script}: Service ${name} is assumed to be stopped" exit 1 fi } ##************************************************************ ## ## Perform the requested action ## ##************************************************************ case $action in ##********************************************************** ## Start the Service ##********************************************************** start | -start) start_service ;; ##********************************************************** ## Stop the Service ##********************************************************** stop | -stop) stop_service ;; ##********************************************************** ## Get the service's status ##********************************************************** status | -status) get_service_status ;; ##********************************************************** ## Restart the service ##********************************************************** restart | -restart) echo "${now} ${script}: Service ${name} is being restarted" stop_service sleep 1 start_service ;; ##********************************************************** ## Unknown option ##********************************************************** *) echo "Invalid option \"$1\"" echo "Usage: $SCRIPT {-}{start|stop|status|restart}" exit 1 esac exit 0
Hi @doug_sas , everyone,
an update and good news. I managed to get this working for one ObjectSpawner.
The issue in SWO was in a mistake/typo, hard to recognize, unless you set up the logs in the daemons as you mentioned.
User=>sasinst < haManagerInit: Cannot authenticate user.
As this is really hard to see in the SWO GUI, I updated through JSON, ensuring no funny characters are included, with Notepad++
In regards of the logs, I like a lot the fact that the strings are delimited.
@doug_sas In regards of the GUI, I would suggest an improvement: first, in the front-end, a neat js script check would help, and some further information ("ADDED" without error needs further description IMHO). In the back-end, a trim and a validation that the user can validate, would prevent a lot of headaches and troubleshooting in the future. In the logs, "Cannot authenticate user" should be an error, definitely, not INFO.
If you could pass this to the responsible team, that would be great. If you want, I could create an entry in the SASBallot ideas, here in the communities.
I will now implement the same for the rest of Object Spawners, and then for the WIP database and I will drop an update to keep the Knowledge Base.
For now, a summary:
nohup $command > $log_filename 2>&1 & # Modify to pick up the PID generated by ObjectSpawner.sh itself - Juan Sanchez #pid=$! #echo $pid > $pid_filename sleep 1 spwn_pid_filename=/sas_application/sasconfig/comp/config/Lev1/ObjectSpawnerUTF8/server.${thisHost}.pid spwn_pid=`cat $spwn_pid_filename` echo $spwn_pid > $pid_filename # echo "${now} ${script}: Service ${name} (pid $pid) is started" }
<logger name="App.Grid.SGMG.Log.HA" additivity="false"> <level value="trace"/> <appender-ref ref="LOG"/></logger>
Once all is done, rollback the changes and repeat if needed for further troubleshooting.
Best regards,
Juan
The script is a sample to use for something that does not come with its own script like ObjectSpawner does. It is not meant to replace the ObjectSpawner.sh script for HA purposes (in fact if you compare some of SAS's scripts, it looks very similar).
If you specify the ObjectSpawner.sh script to a HA service in SWO, what happens?
Thank you @doug_sas for your interest and attention.
If the ObjectSpawner script is used instead, SWO cannot even start or stop the service, or recognize its status (remains in ADDED status).
Did you have the chance to test this yourself? Does it work for you, with sample script, or the OBjectSpawner script?
As side note,, something similar happens when this is done for other services, such as the WIP script. Or any other command/service.
I think it would be interesting to get more specific vendor's advise about how to achieve the High Available services (SAS 9.4) in SWO's grid. At least for us, what is documented, does not seem to work here.
And, in any case, as a sample script has been provided and documented, I think anyone would expect it to work within the boundaries of the documented purpose.
Are you running the script as the install user? The ObjectSpawner.sh script is only executable for the sas install user.
Does the service get scheduled to a daemon to be run? Are there SWO error messages when it tries to start the script?
Are you running the script as the install user? The ObjectSpawner.sh script is only executable for the sas install user.
Always!
Does the service get scheduled to a daemon to be run?
Not sure if I understand that, but I'll give it a shot: no, at this moment every service is started or stopped manually, as it is not fully stable.
Are there SWO error messages when it tries to start the script?
No error messages in the GUI (that is something that perhaps would need attention)
In the log, plenty of a) "HA service instance XXXX has been added", then multiple b) "failed to start on host YYYY", then multiple c) "could not run script specified in service, status=0x0" Which does not help much either - I think.
I would recommend opening a tech support ticket so they can sort out the problem.
Hi Doug,
yes, thanks for the advise. Ticket is open since a week ago, although not getting much activity in there. Hopefully I can get an answer over there.
The reason because I raised the question here is to share the question more informally, but also with a broader spectrum of minds, to find a temporary workaround until the official/supported solution comes to play.
A relatively quick workaround could easily come from fixing 2 simple lines of code! If $! can fetch the right PID, problem is solved for now! I just miss those in-detail bash scripting skills, apparently, and I am pretty sure we have peers with much better skills in this area.
nohup $command > $log_filename 2>&1 & pid=$!
$! is the pid of the last command put into the background which for the script's purpose should be '$command".
What is the $command you are executing and does it put anything into the background too?
Hey Juan,
The script syntax s correct, check it out:
nik at b5 in ~ on master* $ nohup sleep 1234 > my.log 2>&1 & [3] 1885780 nik at b5 in ~ on master* $ pid=$!; echo $pid 1885780 $ strings /proc/$pid/cmdline sleep 1234
I'd make sure that the script you're calling is the target command or execs into it, rather than forking to another pid, like some of the JVM scripts do. That might be your issue.
Nik
Thank you @boemskats , @doug_sas.
@doug_sas look at the script or my description. It is the Object Spawner. Although I tried as well with the webinfdssvrc.sh script.
@boemskats well, if this is true, it would be one of the first answers that make sense and approach the issue. Although, it brings me to the next question. Assuming that is what is going on, and ObjectSpawner.sh and webinfdssvrc.sh processes are forking PIDs because they might be JVM based ... how to workaround this?
The output:
$ nohup /opt/sas/comp/sasconfig/Lev1/ObjectSpawnerUTF8/ObjectSpawner.sh start > /tmp/sas/swo/sas_obj_spwn_utf8_ha_svc.GLQAUEQ1AP523.log 2>&1 & [1] 105316 $ pid=$!; echo $! 105316 [1]+ Done nohup /opt/sas/comp/sasconfig/Lev1/ObjectSpawnerUTF8/ObjectSpawner.sh start > /tmp/sas/swo/sas_obj_spwn_utf8_ha_svc.GLQAUEQ1AP523.log 2>&1 $ strings /proc/$pid/cmdline strings: '/proc/105316/cmdline': No such file $ /opt/sas/comp/sasconfig/Lev1/ObjectSpawnerUTF8/ObjectSpawner.sh status Spawner is started (pid 105338)
If your $command is to call the ObjectSpawner.sh script which itself forks a background process, that may indicate why the PID is wrong.
The object spawner is not JVM based. You should be able to just use the ObjectSpawner.sh script in the HA service. Hopefully tech support can find out why it is not working (which will require you starting the ObjectSpawner service on a daemon where logging has been set to trace to get the needed information)
Thank you again @doug_sas , I hope the Tech Support team is able to come back to me soon.
I updated my previous post with the output of the commands as proposed by @boemskats
Hi @doug_sas / Doug, all,
I managed to resolve the sample script itself, now it can manage properly the start, stop and status actions. However, from SWO is still not capturing the return codes of the now working sample script.
Do you have a way enable the required logging of SWO, only for the HA services management and triggering of scripts and RC capture?
I mean, the change of logconfig.trace.xml to logconfig.xml will provide a lot of information, for sure, but that might be a bit overkilling and distracting, and not sure even if will provide the required information. I think it would be better to do this replacement but with an alternative logconfig.trace.xml with the required options at the required level.
Thank you in advance,
Best regards,
Juan
Turn on the App.Grid.SGMG.Log.HA logger to trace. That will output everything related to HA service processing.
Good one, thanks! I will keep you posted with updates.
The SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment.
SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.
Find more tutorials on the SAS Users YouTube channel.