Platform Process Manager is running without issue, the issue is primarily in Platform LSF. I don't know if your LSF grid control node is the hosted on the same node as your Platform Process Manager server. Aug 12 18:02:16 2020 4440:21868 3 9.1.3 log_jobclean: lsb_puteventrec() failed, errmsg: System call failed: No such file or directory. Aug 12 18:02:16 2020 4440:21868 3 9.1.3 log_jobclean: fflush() failed, Invalid argument. [trimmed] Aug 12 18:15:20 2020 4440:21868 3 9.1.3 log_mbdDie: lsb_puteventrec() failed, errmsg: System call failed: Invalid argument. Aug 12 18:15:20 2020 4440:21868 3 9.1.3 log_mbdDie: fflush() failed, Invalid argument. What specifically happens is that the jobs are submitted to LSF, but are unable to execute due to errors related to lsb_puteventrec(). When the job does not return to PPM as started, PPM times out and kills all of the jobs with log events like this: 2020 Aug 13 01:25:04 1732 6596 3 JFLSFExecutionAgent::_submitToLSF: The job submission script has been running for too long, and is killed by JFD; error code '118'. This pattern has occurred repeatedly in the last year. I'm not certain if this is specifically a network drop or a file system issue, all that we know from the LSF side is that mbatchd can't access this file when the issue occurs, but when it regains access the problem is resolved. My question: I don't know if your LSF grid control node is the hosted on the same node as your Platform Process Manager server. what does mean? how to identify? As i know both are installed in Apllication server.
... View more