Solved: Re: Recursive Segmentation Violations - system exiting error while ex...

John_Wick · Posted 10-26-2023 06:29 AM

Hello, experts!

I have a problem with executing scheduled job from Management console using LSF.

A .sh script is executed from LSF. It is a shell over the execution of the sas command. The required deployed stp (deployed job) and sas system options are passed to the .sh script as parameters.

The sas command inside .sh script looks as follows:

/sas/SAS94/SASFoundation/9.4/sas -xcmd -noterminal -nosyntaxcheck -autoexec ./autoexec.sas -sysin ./scripts/$SCRIPT.sas -nolog -noprint -altlog $LOG_FILE

When the command crashes, there are no entries in altlog. When trying to redirect the error output stream to a file, we get a certain error:

ERROR: Recursive Segmentation Violations - system exiting

This error appears every day with multiple jobs. The problem is that it is not related to any particular job. Today a job can work successfully, and tomorrow it will fall into an error.

Some of the jobs (most of them) run successfully, some of them do not. In case of detecting a problem with the start of a job, restarting the job manually helps.

There are no problems with RAM utilization on the server.
Could you please tell me what this problem may be related to?

John_Wick · Posted 10-27-2023 08:47 AM

The cause of this error could not be found. But we managed to use the following workaround:

Using bhist -all, determine the error status in LSF with which the jobs are falling (in our case Exit code=102. Also with SIGSEGV, Exit code=139 is possible)
Define the queue in which the jobs are executed (in our case QUEUEUE_NAME=normal. The queue name is also specified when using bhist -all)
In the file
<LSF-root-directory>/conf/lsbatch/sas_cluster/configdir/lsb.queues
Add the REQUEUE_EXIT_VALUES and MAX_JOB_REQUEUE parameters:
...
Begin Queue
QUEUEUE_NAME=normal
...
REQUEUE_EXIT_VALUES=139 102
MAX_JOB_REQUEUE=3
...
End Queue
...
Save the changes

This will allow LSF to restart jobs in case of errors 139 and 102. The number of restarts is adjusted in MAX_JOB_REQUEUE. In our case it is 3 times.

View solution in original post

doug_sas · Posted 10-26-2023 08:38 AM

I would try to find the similarities in the SEGVs.

If you have multiple machines, do the SEGVs happen on a specific machine?
Does it happen for a specific user?
Does it happen for a specific SAS program?
Does it happen during a specific time of the day?

John_Wick · Posted 10-26-2023 09:50 AM

1) Yes, we have multiple machines. SEGVs happen on a specific machine
2) This happens for user "lsfadmin", because the scheduled job is running under that user
3) No, it may happen for different sas programs
4) It most often occurs in a specific time frame that is fairly spread out over a period of time

doug_sas · Posted 10-26-2023 10:16 AM

If SEGVs occur on one specific machine, my guess would be that something is out of sync due to a incomplete or failed hotfix installation. Assuming all your machines are at the same version/maintenance level with the same hotfixes installed, compare the files in the <SASROOT>/SASFoundation/9.4 directory on a working machine to the ones on the failing machine to see if there are differences, specifically in shared library files *.so (UNIX/Linux) or *.dll (Windows).

Also, the SEGVs may occur in multiple SAS programs, but only when executing something similar (a specific PROC or data step function). If you can find out what it was executing when SEGV occurred, maybe that would help narrow it down.

John_Wick · Posted 10-26-2023 11:44 AM

The problem is that the error occurs at the stage of running the command itself, which I specified in the problem description.

SASKiwi · Posted 10-26-2023 04:08 PM

I'd suggest opening a track with SAS Tech Support for this. Given it is a machine-specific problem that introduces the possibility this is a hardware-related problem. Is it a physical or virtual server? Hardware issues are more likely with physical servers. It's possible that more diagnostics are needed to identify the problem.

John_Wick · Posted 10-27-2023 08:47 AM

The cause of this error could not be found. But we managed to use the following workaround:

Using bhist -all, determine the error status in LSF with which the jobs are falling (in our case Exit code=102. Also with SIGSEGV, Exit code=139 is possible)
Define the queue in which the jobs are executed (in our case QUEUEUE_NAME=normal. The queue name is also specified when using bhist -all)
In the file
<LSF-root-directory>/conf/lsbatch/sas_cluster/configdir/lsb.queues
Add the REQUEUE_EXIT_VALUES and MAX_JOB_REQUEUE parameters:
...
Begin Queue
QUEUEUE_NAME=normal
...
REQUEUE_EXIT_VALUES=139 102
MAX_JOB_REQUEUE=3
...
End Queue
...
Save the changes

This will allow LSF to restart jobs in case of errors 139 and 102. The number of restarts is adjusted in MAX_JOB_REQUEUE. In our case it is 3 times.

Recursive Segmentation Violations - system exiting error while executing sas command from LSF

Re: Recursive Segmentation Violations - system exiting error while executing sas command from LSF

Re: Recursive Segmentation Violations - system exiting error while executing sas command from LSF

Re: Recursive Segmentation Violations - system exiting error while executing sas command from LSF

Re: Recursive Segmentation Violations - system exiting error while executing sas command from LSF

Re: Recursive Segmentation Violations - system exiting error while executing sas command from LSF

Re: Recursive Segmentation Violations - system exiting error while executing sas command from LSF

Re: Recursive Segmentation Violations - system exiting error while executing sas command from LSF