There is a scheduled flow in our SAS environment that runs every 5 minutes (1,6,11 etc...) between 7am and 7pm. The main program in this flow makes use of a file that has the possibility of being opened by users. Because of this, the program checks it is able to get an exclusive file lock before proceeding, and if it can't, it will loop and check for up to 20 minutes. If it is unable to get the lock, the program exits gracefully, so the flow always appears "Done" in Flow Manager. The problem I'm having is, if the program ever gets in to a situation where a file lock causes other executions of the job to be delayed, Flow Manager seems to give up executing the flow any further past the currently executing instance and any instances that were held up by the originally delayed instance. The most recent occurrence had a user accidently leave the file open for almost 45 minutes. Conveniently the file was opened just prior to the 1 minute past instance starting. The 1 minute past instance executed and then held in its file lock check loop for the full 20 minutes, eventually ending gracefully and holding up the 6, 11, 16 and 21 minute past instances. This resulted in the 6 minute past job executing around 21 minutes past the hour and also looping for the full 20 minutes until about 42 minutes past, holding all the instances in between, and ending gracefully. The 11 minute past instance then executed around 42 minutes past. The file was eventually freed up around 45 minutes past, letting the 11 minute past instance complete its running, followed by the 16, 21, 26, 31, 36, 41, 46 and 51 minute past instances running back to back (46 and 51 executed as they were delayed by the execution of the other instances already queued up). Once they had all completed and run successfully, Flow Manager did not execute any more instances of this flow until it was rescheduled through Schedule Manager in SAS Management Console the following morning. I've been unable to find anything that would explain why LSF is giving up on executing this flow, even though the time period in question has worked on previous days without issue. I've gone through targeted searching of LSF log files to opening every log file I could find for any scrap of information that could point me in the right direction. Any time I find a log file that has any references to the execution of this flow, there are no errors, only references to the delays of waiting for previous instances to finish before the next can run ("Cannot run this flow until the following work item finishes"). When I reach the end point of the situation above, in the log files, there aren't any errors as the flow always ends gracefully, and then the flow is simply not executed again. It is easier enough to reschedule the flow, but I'm stumped as to what is causing it to give up executing any future instances after a delay occurs. The properties of the flow in question are: Flow completion criteria: All items complete successfully or any item fails Actions after the state of the flow is determined: Complete any work in progress and stop running the flow Source for the flow exit code: The sum of the exit codes for all work items Allow only one instance of the flow to run at a time Run only when all of the conditions occur (there is only the below condition) Calendar: Daily@sys Hours: 7 - 18 Minutes: 1,6,11,16,21,26,31,36,41,46,51,56 Duration: 1
... View more