BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
John_Wick
Obsidian | Level 7

Hello, experts!

 

I have a problem with executing scheduled job from Management console using LSF.

 

A .sh script is executed from LSF. It is a shell over the execution of the sas command. The required deployed stp (deployed job) and sas system options are passed to the .sh script as parameters.

 

The sas command inside .sh script looks as follows:

/sas/SAS94/SASFoundation/9.4/sas -xcmd -noterminal -nosyntaxcheck -autoexec ./autoexec.sas -sysin ./scripts/$SCRIPT.sas -nolog -noprint -altlog $LOG_FILE

 

When the command crashes, there are no entries in altlog. When trying to redirect the error output stream to a file, we get a certain error:

ERROR: Recursive Segmentation Violations - system exiting 

 

This error appears every day with multiple jobs. The problem is that it is not related to any particular job. Today a job can work successfully, and tomorrow it will fall into an error.

Some of the jobs (most of them) run successfully, some of them do not. In case of detecting a problem with the start of a job, restarting the job manually helps.

 

There are no problems with RAM utilization on the server.
Could you please tell me what this problem may be related to?

1 ACCEPTED SOLUTION

Accepted Solutions
John_Wick
Obsidian | Level 7

The cause of this error could not be found. But we managed to use the following workaround:

 

  1. Using bhist -all, determine the error status in LSF with which the jobs are falling (in our case Exit code=102. Also with SIGSEGV, Exit code=139 is possible)
  2. Define the queue in which the jobs are executed (in our case QUEUEUE_NAME=normal. The queue name is also specified when using bhist -all)
  3. In the file
    <LSF-root-directory>/conf/lsbatch/sas_cluster/configdir/lsb.queues
    Add the REQUEUE_EXIT_VALUES and MAX_JOB_REQUEUE parameters:
    ...
    Begin Queue
    QUEUEUE_NAME=normal
    ...
    REQUEUE_EXIT_VALUES=139 102
    MAX_JOB_REQUEUE=3
    ...
    End Queue
    ...
  4. Save the changes

This will allow LSF to restart jobs in case of errors 139 and 102. The number of restarts is adjusted in MAX_JOB_REQUEUE. In our case it is 3 times.

View solution in original post

6 REPLIES 6
doug_sas
SAS Employee

I would try to find the similarities in the SEGVs.

  • If you have multiple machines, do the SEGVs happen on a specific machine?
  • Does it happen for a specific user?
  • Does it happen for a specific SAS program?
  • Does it happen during a specific time of the day?
John_Wick
Obsidian | Level 7
1) Yes, we have multiple machines. SEGVs happen on a specific machine
2) This happens for user "lsfadmin", because the scheduled job is running under that user
3) No, it may happen for different sas programs
4) It most often occurs in a specific time frame that is fairly spread out over a period of time
doug_sas
SAS Employee

If SEGVs occur on one specific machine, my guess would be that something is out of sync due to a incomplete or failed hotfix installation. Assuming all your machines are at the same version/maintenance level with the same hotfixes installed, compare the files in the <SASROOT>/SASFoundation/9.4 directory on a working machine to the ones on the failing machine to see if there are differences, specifically in shared library files *.so (UNIX/Linux) or *.dll (Windows).

 

Also, the SEGVs may occur in multiple SAS programs, but only when executing something similar (a specific PROC or data step function). If you can find out what it was executing when SEGV occurred, maybe that would help narrow it down.

John_Wick
Obsidian | Level 7
The problem is that the error occurs at the stage of running the command itself, which I specified in the problem description.
SASKiwi
PROC Star

I'd suggest opening a track with SAS Tech Support for this. Given it is a machine-specific problem that introduces the possibility this is a hardware-related problem. Is it a physical or virtual server? Hardware issues are more likely with physical servers. It's possible that more diagnostics are needed to identify the problem.

John_Wick
Obsidian | Level 7

The cause of this error could not be found. But we managed to use the following workaround:

 

  1. Using bhist -all, determine the error status in LSF with which the jobs are falling (in our case Exit code=102. Also with SIGSEGV, Exit code=139 is possible)
  2. Define the queue in which the jobs are executed (in our case QUEUEUE_NAME=normal. The queue name is also specified when using bhist -all)
  3. In the file
    <LSF-root-directory>/conf/lsbatch/sas_cluster/configdir/lsb.queues
    Add the REQUEUE_EXIT_VALUES and MAX_JOB_REQUEUE parameters:
    ...
    Begin Queue
    QUEUEUE_NAME=normal
    ...
    REQUEUE_EXIT_VALUES=139 102
    MAX_JOB_REQUEUE=3
    ...
    End Queue
    ...
  4. Save the changes

This will allow LSF to restart jobs in case of errors 139 and 102. The number of restarts is adjusted in MAX_JOB_REQUEUE. In our case it is 3 times.

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1816 views
  • 2 likes
  • 3 in conversation