Hello, experts!
I have a problem with executing scheduled job from Management console using LSF.
A .sh script is executed from LSF. It is a shell over the execution of the sas command. The required deployed stp (deployed job) and sas system options are passed to the .sh script as parameters.
The sas command inside .sh script looks as follows:
/sas/SAS94/SASFoundation/9.4/sas -xcmd -noterminal -nosyntaxcheck -autoexec ./autoexec.sas -sysin ./scripts/$SCRIPT.sas -nolog -noprint -altlog $LOG_FILE
When the command crashes, there are no entries in altlog. When trying to redirect the error output stream to a file, we get a certain error:
ERROR: Recursive Segmentation Violations - system exiting
This error appears every day with multiple jobs. The problem is that it is not related to any particular job. Today a job can work successfully, and tomorrow it will fall into an error.
Some of the jobs (most of them) run successfully, some of them do not. In case of detecting a problem with the start of a job, restarting the job manually helps.
There are no problems with RAM utilization on the server.
Could you please tell me what this problem may be related to?
The cause of this error could not be found. But we managed to use the following workaround:
This will allow LSF to restart jobs in case of errors 139 and 102. The number of restarts is adjusted in MAX_JOB_REQUEUE. In our case it is 3 times.
I would try to find the similarities in the SEGVs.
If SEGVs occur on one specific machine, my guess would be that something is out of sync due to a incomplete or failed hotfix installation. Assuming all your machines are at the same version/maintenance level with the same hotfixes installed, compare the files in the <SASROOT>/SASFoundation/9.4 directory on a working machine to the ones on the failing machine to see if there are differences, specifically in shared library files *.so (UNIX/Linux) or *.dll (Windows).
Also, the SEGVs may occur in multiple SAS programs, but only when executing something similar (a specific PROC or data step function). If you can find out what it was executing when SEGV occurred, maybe that would help narrow it down.
I'd suggest opening a track with SAS Tech Support for this. Given it is a machine-specific problem that introduces the possibility this is a hardware-related problem. Is it a physical or virtual server? Hardware issues are more likely with physical servers. It's possible that more diagnostics are needed to identify the problem.
The cause of this error could not be found. But we managed to use the following workaround:
This will allow LSF to restart jobs in case of errors 139 and 102. The number of restarts is adjusted in MAX_JOB_REQUEUE. In our case it is 3 times.
The SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment.
SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.
Find more tutorials on the SAS Users YouTube channel.