BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
senu188
Quartz | Level 8

Platform Process Manager is running without issue, the issue is primarily in Platform LSF. I don't know if your LSF grid control node is the hosted on the same node as your Platform Process Manager server.

 

Aug 12 18:02:16 2020 4440:21868 3 9.1.3 log_jobclean: lsb_puteventrec() failed, errmsg: System call failed: No such file or directory.

Aug 12 18:02:16 2020 4440:21868 3 9.1.3 log_jobclean: fflush() failed, Invalid argument.

[trimmed]

Aug 12 18:15:20 2020 4440:21868 3 9.1.3 log_mbdDie: lsb_puteventrec() failed, errmsg: System call failed: Invalid argument.

Aug 12 18:15:20 2020 4440:21868 3 9.1.3 log_mbdDie: fflush() failed, Invalid argument.

 

What specifically happens is that the jobs are submitted to LSF, but are unable to execute due to errors related to lsb_puteventrec(). When the job does not return to PPM as started, PPM times out and kills all of the jobs with log events like this:

 

2020 Aug 13 01:25:04 1732 6596 3 JFLSFExecutionAgent::_submitToLSF: The job submission script has been running for too long, and is killed by JFD; error code '118'.

 

This pattern has occurred repeatedly in the last year. I'm not certain if this is specifically a network drop or a file system issue, all that we know from the LSF side is that mbatchd can't access this file when the issue occurs, but when it regains access the problem is resolved.

 

 

 

My question: 

I don't know if your LSF grid control node is the hosted on the same node as your Platform Process Manager server. what does mean? how to identify? As i know both are installed in Apllication server.

1 ACCEPTED SOLUTION

Accepted Solutions
JuanS_OCS
Amethyst | Level 16

Hi @senu188 ,

 

I am not aware of your company security policies, but if I would be you, I would definitely delete those 2 latest posts, or I would mask the company sensitive information asap, just in case. My 2 cents.

 

This being said:

 

- It was relevant the info "IBM PPM version 9.1.3.0". I would apply the latest patches: http://ftp.sas.com/techsup/download/hotfix/platformpatch.html

https://support.sas.com/kb/63/415.html

 

-  I believe I found the issue description in IBM support site: https://www.ibm.com/support/pages/jfd-killed-job-submission-script-flow

 

Cause

If the job submission script runs for more than 5 minutes (i.e. default value of JS_JOB_SUBMISSION_SCRIPT_TIME_OUT), JFD will kill the job submission script. JS_JOB_SUBMISSION_SCRIPT_TIME_OUT specifies the length of time for which the job submission script can run before the Process Manager daemon (JFD) kills the script.

Resolving The Problem

To increase the waiting time before JFD kills the submission script, you can configure JS_JOB_SUBMISSION_SCRIPT_TIME_OUT in js.conf and restart JFD. Please note that if you set JS_EXTERNAL_EXECUTION=true in js.conf, the following parameters for job submission will not work: JS_JOB_SUBMISSION_SCRIPT_TIME_OUT, JS_JOB_SUBMISSION_TIMEOUT, JS_JOB_SUBMISSION_RETRY, JS_BSUB_RETRY_EXIT_VALUES.

 

 

View solution in original post

13 REPLIES 13
Anand_V
Ammonite | Level 13
Run the command 'jid' and 'lsid'
senu188
Quartz | Level 8
Hi,
Thanks for reply.
" we know from the LSF side is that mbatchd can't access this file when the issue occurs"

Cani know why it is occurs and soultion for it?
Anand_V
Ammonite | Level 13
It seems from the original post that the mbatchd process is unable to read a config file intermittently. As mentioned in the diagnosis shared it could be due to network issue or file-system. Have you reached out to network and OS support team to see if there are any system logs reporting this error as well?
senu188
Quartz | Level 8
Hi,
what should check with network team? As i see file of the lsb.events in shared location
J:\SAS_App\SAS94m3\LSFShare\work\cluster1\logdir. we installed platform in the server (47). when the error comes, what i should check and how to check?
JuanS_OCS
Amethyst | Level 16

Hello @senu188 ,

 

I will assume for now you do not have a firewall rule in place blocking LSF/JS/EGO port numbers (please check that if you do, you can try to temporary disable firewall - if allowed - just to see if the issue keeps happening).

 

Would it be possible for you to share the type and version of the distributed/shared file system you use, and the parameters used for the mount? (I am specifically considering if flock is in place)

 

It would be interesting as well to learn of JS_SHARED and LSF_SHARED directories are in this same share and if all mounts are the same in every node. Has the lsfadmin user got the same uid across all machines?

 

In any case, I think you do need to get further information, in a more scientific approach. You can choose one or several of the following ones:

 

- Check and regularly monitor system logs (messages, security, etc) on every node.

- Enable extended logging for JS and LSF, increasing them temporarily to DEBUG or TRACE.

- Check lsof and lslocks for the unreadable file on every machine AND the underlying shared file system servers.

 

 

senu188
Quartz | Level 8
Hi,

I am very new to SAS admin. can you tell me ways how can we check ? can we use the commands %JS_ENVDIR% and %LSF_ENVDIR%.
JuanS_OCS
Amethyst | Level 16

Hello @senu188 ,

 

I understand! Well, LSF is not a SAS proprietary software, but from IBM. Nonetheless, a few quick pointers:


A couple of nice cheat sheets, from IBM. You can also navigate and learn more:

https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_quick_reference/lsf_quick_ref.html

https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_unix_install/lsf_installnewunix_dirstr...

 

This being said, you can locate the js.conf and lsf.conf files in this sharedfilesystem, they will be in the /conf folders of each product (LSF and JS). Inside the directories, you can see every variable, including folder paths.

 

Based on above, one of them should be in J:\SAS_App\SAS94m3\LSFShare\conf. And LSF_SHARE is J:\SAS_App\SAS94m3\LSFShare

 

When you have a moment, please find about my previous request:

"Would it be possible for you to share the type and version of the distributed/shared file system you use, and the parameters used for the mount? (I am specifically considering if flock is in place)"

JuanS_OCS
Amethyst | Level 16

Hi @senu188 ,

 

I am not aware of your company security policies, but if I would be you, I would definitely delete those 2 latest posts, or I would mask the company sensitive information asap, just in case. My 2 cents.

 

This being said:

 

- It was relevant the info "IBM PPM version 9.1.3.0". I would apply the latest patches: http://ftp.sas.com/techsup/download/hotfix/platformpatch.html

https://support.sas.com/kb/63/415.html

 

-  I believe I found the issue description in IBM support site: https://www.ibm.com/support/pages/jfd-killed-job-submission-script-flow

 

Cause

If the job submission script runs for more than 5 minutes (i.e. default value of JS_JOB_SUBMISSION_SCRIPT_TIME_OUT), JFD will kill the job submission script. JS_JOB_SUBMISSION_SCRIPT_TIME_OUT specifies the length of time for which the job submission script can run before the Process Manager daemon (JFD) kills the script.

Resolving The Problem

To increase the waiting time before JFD kills the submission script, you can configure JS_JOB_SUBMISSION_SCRIPT_TIME_OUT in js.conf and restart JFD. Please note that if you set JS_EXTERNAL_EXECUTION=true in js.conf, the following parameters for job submission will not work: JS_JOB_SUBMISSION_SCRIPT_TIME_OUT, JS_JOB_SUBMISSION_TIMEOUT, JS_JOB_SUBMISSION_RETRY, JS_BSUB_RETRY_EXIT_VALUES.

 

 

doug_sas
SAS Employee

When any action occurs for a job (such as it being submitted, scheduled to run, started, completed, etc.), an 'event' record is recorded in the lsb.events file. This file is used for master failover so the new master knows what is going on in the grid.

 

The errors appear to indicate that there are times that LSF cannot write an event record. If this happens when a job is submitted, I am not sure the job would ever run, or if it happened when the job completed, I am not sure LSF would ever indicated the job completed (resulting in the PPM timeout).

 

Unfortunately you would need to see if your IT people do something during those times that makes the shared directory unavailable or the network to that shared directory unavailable. For example, if the share is on a Windows machine and the IT department applies an update requiring the machine to be rebooted, the shared directory would become unavailable. If things like that are going on, you will need to make sure jobs are not running during the time the network share is unavailable.

senu188
Quartz | Level 8
Hi,
Thanks. I will try the Ping the IP address to shared file from server ? can I find the network outage on it?
doug_sas
SAS Employee

It appears that you only know about the problem after it happens so pinging the server may not tell you anything.

 

The next time you see errors in the LSF logs about not being able to write out events, contact your IT department and see if there was anything that could have caused the shared filesystem to not be available at that specific time. You will need to work with them to check network logs, Windows events, etc. to find the cause of the problem.

 

JuanS_OCS
Amethyst | Level 16

I tend to agree with @doug_sas. He is a guru in terms of SAS Grid Manager.

 

I still wonder what shared file system you use (i.e an NFS is mostly not recommended and can potentially cause this kind of issues), but I leave it to you, @senu188 

 

Perhaps you can give a try to the info I shared earlier, and when the issue appears again, I would indeed follow @doug_sas wise advise and align with your sysadmin peers.

senu188
Quartz | Level 8
hI,

Ok sure.

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 2321 views
  • 5 likes
  • 4 in conversation