BookmarkSubscribeRSS Feed
MariaD
Barite | Level 11

Hi Folks,

 

We have a SAS Grid installed on Linux. We restarted our server and after that we find few jobs (listed using bjobs command) with submitted date before the restarted time. How could be possible? 

 

Regards,

8 REPLIES 8
rbetancourt
Obsidian | Level 7

I recommend you ensure all of the hot-fixes are up to date.  I recently deployed 9.4M5 Grid cluster and recall there being a hotfix which addressed jobs executed with SASGSUB report the correct job start time to LSF.

MariaD
Barite | Level 11

Hi @rbetancourt 

 

Thanks, do you know the name/link of the hotfix?

 

Regards, 

rbetancourt
Obsidian | Level 7

Hi @MariaD,

 

In our case, it was:

 

http://support.sas.com/kb/61/718.html

 

HTH.

 

Best regards,

Randy

AnandVyas
Ammonite | Level 13
Hi @MariaD,

Restarted server or SAS Services? LSF services also restarted? What is the state of these jobs? By any chance they were in PEND state while restarting?
MariaD
Barite | Level 11

Hi @AnandVyas ,

 

We restarted the entire server, not only the services. The state of the JOB is RUN. 

 

Regards, 

doug_sas
SAS Employee

Were the jobs submitted before or after the server was restarted? If jobs were submitted after the server restart and show the submit time before the restart something is wrong. If jobs were submitted before the restart, then everything is normal.

Jobs can be submitted, LSF restarted, and the jobs will still show the time they were submitted because LSF logs 'events' (like job submission/starting/termination) that it reads when it comes up or when a master changes.

MariaD
Barite | Level 11

Hi @doug_sas 

 

The jobs were submitted before we restarted the server. One more question, which is the best way to kill a job preventing zombie or defunct process/jobs? Sometimes when we use bkill the jobs is not killed correctly and it became defunct or zombie. Even if we user bkill -9 it happens. Any way to prevent it?

 

Regards, 

doug_sas
SAS Employee

If you just want to remove them from LSF, you can use 'bkill -r <jobID>'. 

If they are removed from LSF, but the OS has not killed them the processes probably have not handled the SIGTERM/SIGKILL signal correctly.

 

You could talk to SAS tech support to see if IBM tech support understands why it is not working.

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1605 views
  • 1 like
  • 4 in conversation