When any action occurs for a job (such as it being submitted, scheduled to run, started, completed, etc.), an 'event' record is recorded in the lsb.events file. This file is used for master failover so the new master knows what is going on in the grid.
The errors appear to indicate that there are times that LSF cannot write an event record. If this happens when a job is submitted, I am not sure the job would ever run, or if it happened when the job completed, I am not sure LSF would ever indicated the job completed (resulting in the PPM timeout).
Unfortunately you would need to see if your IT people do something during those times that makes the shared directory unavailable or the network to that shared directory unavailable. For example, if the share is on a Windows machine and the IT department applies an update requiring the machine to be rebooted, the shared directory would become unavailable. If things like that are going on, you will need to make sure jobs are not running during the time the network share is unavailable.
... View more