BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JuanS_OCS
Azurite | Level 17

Hello everyone,

 

I am trying to enable High Availability services in the new grid, SAS Workload Orchestrator. For that, the services must handle start/stop/status/restart on a way that SWO will be able to handle (of course). For this, the documentation proposes a “sample script” to wrap the HA-to-be services. As you will see below, nothing really to do with SWO or the services per-se, it is just bash scripting and the identification and handling of the PID.

 

The documentation: https://documentation.sas.com/?docsetId=gridref&docsetTarget=p02b7o2r85b0kon1rnsq2orm53f5.htm&docset...

 

However, as far as I can see, the sample script can not identify/store the PID of the child process itself, see piece of code in bold letters (the Object Spawner in this case), when running the start script, I would say it is identifying the "nohup" PID …. which of course makes the rest of the script to fail miserably, as example, when running it with the status parameter.

 

The bash scripting is used indeed in the standard/best practice way. I googled it and indeed it seems the general recommendation, but it is clear that something fails. 

 

Can anyone help me to identify the required changes to the script?

 

PS. I also tried without this sample script, just trying to use the sh script from the ObjectSpawner (or the WIP database), but I am not getting better results, but worse.

 

Thank you in advance!

 

 

 

#!/bin/sh
##************************************************************
##
## Example script to handle a service
##
##************************************************************

action=$1

name="sas_obj_spwn_utf8_ha_svc"
script=`basename "$0"`
now=`date +%Y-%m-%d@%H:%M:%S`
command="/opt/sas/comp/sasconfig/Lev1/ObjectSpawnerUTF8/ObjectSpawner.sh start"
thisHost=`hostname`
log_filename="/tmp/sas/swo/${name}.${thisHost}.log"
pid_filename="/tmp/sas/swo/${name}.${thisHost}.pid"

##************************************************************
##
## Define the functions to be used
##
##************************************************************

##**********************************************************
## Start the Service
##**********************************************************
start_service()
{
  if [ -f $pid_filename ]; then
     pid=`cat $pid_filename`
     kill -0 $pid > /dev/null 2>&1
     if [ $? -eq 0 ]; then
        echo "${now} ${script}: Service ${name} (pid $pid) is already running"
        exit 0
     fi
     rm $pid_filename
  fi
  nohup $command > $log_filename 2>&1 &
  pid=$!
  echo $pid > $pid_filename
  echo "${now} ${script}: Service ${name} (pid $pid) is started"
}

##**********************************************************
## Stop the Service
##**********************************************************
stop_service()
{
  if [ -f $pid_filename ]; then
    pid=`cat $pid_filename`
    kill $pid > /dev/null 2>&1
    if [ $? -ne 0 ]; then
      echo "${now} ${script}: Service ${name} (pid $pid) could not be stopped"
    else
      echo "${now} ${script}: Service ${name} (pid $pid) has been stopped"
      rm $pid_filename
    fi
  else
    echo "${now} ${script}: Service ${name} is stopped"
    exit 1
  fi
}

##**********************************************************
## Get the Service's status
##
##   status = 0, everything is OK
##   status < 0, temp error, retry 5 times before restarting
##   status > 0, error, try restarting
##**********************************************************
get_service_status()
{
  if [ -f $pid_filename ]; then
    pid=`cat $pid_filename`
    kill -0 $pid > /dev/null 2>&1
    if [ $? -ne 0 ]; then
      echo "${now} ${script}: Service ${name} (pid $pid) is stopped"
      exit 1
    else
      echo "${now} ${script}: Service ${name} (pid $pid) is running"
      exit 0
    fi
  else
    echo "${now} ${script}: Service ${name} is assumed to be stopped"
    exit 1
  fi
}




##************************************************************
##
## Perform the requested action
##
##************************************************************
case $action in

  ##**********************************************************
  ## Start the Service
  ##**********************************************************
  start | -start)
    start_service
    ;;

  ##**********************************************************
  ## Stop the Service
  ##**********************************************************
  stop | -stop)
    stop_service
    ;;

  ##**********************************************************
  ## Get the service's status
  ##**********************************************************
  status | -status)
    get_service_status
    ;;

  ##**********************************************************
  ## Restart the service
  ##**********************************************************
  restart | -restart)
    echo "${now} ${script}: Service ${name} is being restarted"
    stop_service
    sleep 1
    start_service
    ;;

  ##**********************************************************
  ## Unknown option
  ##**********************************************************
  *)
    echo "Invalid option \"$1\""
    echo "Usage: $SCRIPT {-}{start|stop|status|restart}"
    exit 1
esac

exit 0

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
JuanS_OCS
Azurite | Level 17

Hi @doug_sas , everyone,

 

an update and good news. I managed to get this working for one ObjectSpawner. 


The issue in SWO was in a mistake/typo, hard to recognize, unless you set up the logs in the daemons as you mentioned.

 

 

User=>sasinst <
haManagerInit: Cannot authenticate user.

As this is really hard to see in the SWO GUI, I updated through JSON, ensuring no funny characters are included, with Notepad++

 

 

In regards of the logs, I like a lot the fact that the strings are delimited.

 

@doug_sas In regards of the GUI, I would suggest an improvement: first, in the front-end, a neat js script check would help, and some further information ("ADDED" without error needs further description IMHO). In the back-end, a trim and a validation that the user can validate, would prevent a lot of headaches and troubleshooting in the future. In the logs, "Cannot authenticate user" should be an error, definitely, not INFO.

 

If you could pass this to the responsible team, that would be great. If you want, I could create an entry in the SASBallot ideas, here in the communities.

 

I will now implement the same for the rest of Object Spawners, and then for the WIP database and I will drop an update to keep the Knowledge Base.

 

For now, a summary:

 

  • Implement the provided sample script with custom improvements to be done to the sample script, in order to capture the PID of the Object Spawner. Beware, the script should not write, at all, the PID file generated by the Object Spawner itself. The script needs only read it once, to be able write the value into its own PID file as the sample script proposes.
nohup $command > $log_filename 2>&1 &
  # Modify to pick up the PID generated by ObjectSpawner.sh itself - Juan Sanchez
  #pid=$!
  #echo $pid > $pid_filename
  sleep 1
  spwn_pid_filename=/sas_application/sasconfig/comp/config/Lev1/ObjectSpawnerUTF8/server.${thisHost}.pid
  spwn_pid=`cat $spwn_pid_filename`
  echo $spwn_pid > $pid_filename
  #
    echo "${now} ${script}: Service ${name} (pid $pid) is started"
}

 

  • Test that the sample script works OK for stop, start, status and restart, from CLI. If script does not work, SWO won't either, of course.
  • Configure SWO accordingly if the HA/failover will be in Active-Active (such as Object Spawners) or Active-Passive (such as the WIP database).
    • For the first case, "Number of instances" must be the number of your Grid nodes (or a lower value in case you don't want/need to enforce all the nodes).
    • For the second case, a value of 1.
  • HA service is configured, it is highly advised double to check all the values as the js support is limited at the moment
  • In case something runs unexpected (Troubleshooting) further understanding of the SWO mechanics must happen. You can do this getting support from SAS Technical Support or by yourself with below tips:
    • (Optional) Disable the SWO HA service created
    • Stop the daemons (sgmh.sh) in every node: /path/config/Lev1/Grid/sgmg.sh stop
    • (Optional) Preferably, clean/archive the current SWO logs
    • Backup logconfig.trace.xml and logconfig.xml
    • Add a line in logconfig.trace.xml under "Grid Debug Loggers" block
    • <logger name="App.Grid.SGMG.Log.HA"        additivity="false"> <level value="trace"/> <appender-ref ref="LOG"/></logger>
    • Overwrite logconfig.xml with logconfig.trace.xml
    • Start SWO: /path/config/Lev1/Grid/sgmg.sh start
    • Enable the HA service and wait for a moment
    • From the logs, look for entries matching "App.Grid.SGMG.Log.HA". 
      • Example: User=>sasinst < and haManagerInit: Cannot authenticate user

Once all is done, rollback the changes and repeat if needed for further troubleshooting.

 

Best regards,

Juan

 

 

 

 

 

View solution in original post

24 REPLIES 24
doug_sas
SAS Employee

The script is a sample to use for something that does not come with its own script like ObjectSpawner does. It is not meant to replace the ObjectSpawner.sh script for HA purposes (in fact if you compare some of SAS's scripts, it looks very similar).

 

If you specify the ObjectSpawner.sh script to a HA service in SWO, what happens?

JuanS_OCS
Azurite | Level 17

Thank you @doug_sas for your interest and attention.

 

If the ObjectSpawner script is used instead, SWO cannot even start or stop the service, or recognize its status (remains in ADDED status).

 

Did you have the chance to test this yourself? Does it work for you, with sample script, or the OBjectSpawner script?

 

As side note,, something similar happens when this is done for other services, such as the WIP script. Or any other command/service.

 

I think it would be interesting to get more specific vendor's advise about how to achieve the High Available services (SAS 9.4) in SWO's grid. At least for us, what is documented, does not seem to work here.

 

And, in any case, as a sample script has been provided and documented, I think anyone would expect it to work within the boundaries of the documented purpose.

doug_sas
SAS Employee

Are you running the script as the install user? The ObjectSpawner.sh script is only executable for the sas install user.

 

Does the service get scheduled to a daemon to be run? Are there SWO error messages when it tries to start the script?

JuanS_OCS
Azurite | Level 17

 

Are you running the script as the install user? The ObjectSpawner.sh script is only executable for the sas install user.

Always!

Does the service get scheduled to a daemon to be run? 

Not sure if I understand that, but I'll give it a shot: no, at this moment every service is started or stopped manually, as it is not fully stable.

Are there SWO error messages when it tries to start the script?

No error messages in the GUI (that is something that perhaps would need attention)

In the log, plenty of a) "HA service instance XXXX has been added", then multiple b) "failed to start on host YYYY", then multiple c) "could not run script specified in service, status=0x0" Which does not help much either - I think.

 

 

doug_sas
SAS Employee

I would recommend opening a tech support ticket so they can sort out the problem.

JuanS_OCS
Azurite | Level 17

Hi Doug,

 

yes, thanks for the advise. Ticket is open since a week ago, although not getting much activity in there. Hopefully I can get an answer over there.

 

The reason because I raised the question here is to share the question more informally, but also with a broader spectrum of minds, to find a temporary workaround until the official/supported solution comes to play.

A relatively quick workaround could easily come from fixing 2 simple lines of code! If $! can fetch the right PID, problem is solved for now!  I just miss those in-detail bash scripting skills, apparently, and I am pretty sure we have peers with much better skills in this area.

 

  nohup $command > $log_filename 2>&1 &
  pid=$!

 

 

doug_sas
SAS Employee

$! is the pid of the last command put into the background which for the script's purpose should be '$command".

 

What is the $command you are executing and does it put anything into the background too?

boemskats
Lapis Lazuli | Level 10

Hey Juan,

 

The script syntax s correct, check it out:

 

nik at b5 in ~ on master*
$ nohup sleep 1234 > my.log 2>&1 &
[3] 1885780
nik at b5 in ~ on master*
$ pid=$!; echo $pid
1885780
$ strings /proc/$pid/cmdline
sleep
1234

I'd make sure that the script you're calling is the target command or execs into it, rather than forking to another pid, like some of the JVM scripts do. That might be your issue.

 

Nik

JuanS_OCS
Azurite | Level 17

Thank you @boemskats , @doug_sas.

 

@doug_sas look at the script or my description. It is the Object Spawner. Although I tried as well with the webinfdssvrc.sh script.

 

@boemskats well, if this is true, it would be one of the first answers that make sense and approach the issue. Although, it brings me to the next question. Assuming that is what is going on, and ObjectSpawner.sh and webinfdssvrc.sh processes are forking PIDs because they might be JVM based ... how to workaround this?

 

The output:

 

$ nohup /opt/sas/comp/sasconfig/Lev1/ObjectSpawnerUTF8/ObjectSpawner.sh start > /tmp/sas/swo/sas_obj_spwn_utf8_ha_svc.GLQAUEQ1AP523.log 2>&1 &
[1] 105316

$ pid=$!; echo $!
105316
[1]+  Done                    nohup /opt/sas/comp/sasconfig/Lev1/ObjectSpawnerUTF8/ObjectSpawner.sh start > /tmp/sas/swo/sas_obj_spwn_utf8_ha_svc.GLQAUEQ1AP523.log 2>&1

$ strings /proc/$pid/cmdline
strings: '/proc/105316/cmdline': No such file

$ /opt/sas/comp/sasconfig/Lev1/ObjectSpawnerUTF8/ObjectSpawner.sh status
Spawner is started (pid 105338)

 

 

doug_sas
SAS Employee

If your $command is to call the ObjectSpawner.sh script which itself forks a background process, that may indicate why the PID is wrong.

 

The object spawner is not JVM based. You should be able to just use the ObjectSpawner.sh script in the HA service. Hopefully tech support can find out why it is not working (which will require you starting the ObjectSpawner service on a daemon where logging has been set to trace to get the needed information)

JuanS_OCS
Azurite | Level 17

Thank you again @doug_sas , I hope the Tech Support team is able to come back to me soon.

 

I updated my previous post with the output of the commands as proposed by @boemskats 

JuanS_OCS
Azurite | Level 17

Hi @doug_sas / Doug, all,

 

I managed to resolve the sample script itself, now it can manage properly the start, stop and status actions. However, from SWO is still not capturing the return codes of the now working sample script.

 

Do you have a way enable  the required logging of SWO, only for the HA services management and triggering of scripts and RC capture?

I mean, the change of logconfig.trace.xml to logconfig.xml will provide a lot of information, for sure, but that might be a bit overkilling and distracting, and not sure even if will provide the required information. I think it would be better to do this replacement but with an alternative logconfig.trace.xml with the required options at the required level.

 

Thank you in advance,

Best regards,

Juan

 

 

doug_sas
SAS Employee

Turn on the App.Grid.SGMG.Log.HA logger to trace. That will output everything related to HA service processing.

JuanS_OCS
Azurite | Level 17

Good one, thanks! I will keep you posted with updates.

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 24 replies
  • 4379 views
  • 2 likes
  • 3 in conversation