We have numerous flows (150-200) that are triggered by all kinds of triggering events. Recently I saw some weird behavior where certain flows which are scheduled to run using the standard Daily@Sys calendar (varying times of the day) just simply did not start. There is no record in Flow Manager showing that they even tried (i.e. no history). They've been running fine for a long time, but simply did not run on a given day. Then they ran fine the next day?
Does anyone know how to troubleshoot/investigate? I'm guessing there are relevant logs somewhere within the PM product install directory, but I'm not as familiar with this product as I am with SAS products.
My first hunch is that maybe there's a load/server limit somewhere? I would kind of doubt it, though, since the latest instance had to do with a job at 12:05 am, and we do not have many jobs scheduled for that time.
In case it's relevant, we are on SAS 9.4M4, Platform Suite 9.1.3, Linux x64 servers in a grid environment.
Thanks!
SAS Tech Support recommended I take various steps to clear out the JFD/Process Manager history and cache. I followed their recommendation and so far no issues. Here's what they told me:
1. Backup/delete lsb.events.* only keep lsb.events and lsb.events.1
2. Backup and Delete the following two files
(a) $JS_HOME/work/system/jobidmap.dat.1
(b) $JS_HOME/work/system/ lsf.events
3. Backup/Delete all files in $JS_HOME/work/history except created at the last 3-5 days
4. Backup/Delete all files in $JS_HOME/work/events except created at the last 3-5 days, only keep js.events
5. Backup/Delete all files in $JS_HOME/work/variable except created at the last 3-5 days
6. Backup/Delete all files in $JS_HOME/work/storage/error except created at the last 3-5 days
7. Backup/Delete all files in $JS_HOME/work/storage/flow_instance_storage/finished , leave 5 days
8. Backup/Delete $LSF_HOME/work/<cluster>/logdir/lsb.event.* except created at the last 3-5 days ($LSF_HOME D=\LSF_51\work\cluster1\logdir)
9. Backup/Delete all files in $JS_HOME/work/storage/cache/
10. Backup/Delete $JS_HOME/log file located there!!
Hello @Timmy2383,
if you are on grid, there might be some limits on job execution, they can be placed on the top of a queue, but they will be triggered, and they will execute at a certain point. There is a difference between the Process Manager (JS) and the LSF resource manager. The process manager won;t act much differently than any other scheduler, such as cron or at. The only difference is that you can customise calendars.
First, I would check the jfd file (you can use locate, however it will be on your JS_Top/work directory.
Second, I wonder if you could check what other things happened on the server at the point when the job did not run. Example: the jfd service was stopped, hence the not triggering of the job. Or maintenance on the server (security patches?).
Thanks, Juan.
There definitely wasn't any maintenance going on. It's possible the JFD stopped for some reason but was restarted by EGO, not sure how I could determine that, though.
So far SAS TS has recommended clearing out many of the PM logs and cache. I will have do this during the next scheduled maintenance window and then monitor after that.
SAS Tech Support recommended I take various steps to clear out the JFD/Process Manager history and cache. I followed their recommendation and so far no issues. Here's what they told me:
1. Backup/delete lsb.events.* only keep lsb.events and lsb.events.1
2. Backup and Delete the following two files
(a) $JS_HOME/work/system/jobidmap.dat.1
(b) $JS_HOME/work/system/ lsf.events
3. Backup/Delete all files in $JS_HOME/work/history except created at the last 3-5 days
4. Backup/Delete all files in $JS_HOME/work/events except created at the last 3-5 days, only keep js.events
5. Backup/Delete all files in $JS_HOME/work/variable except created at the last 3-5 days
6. Backup/Delete all files in $JS_HOME/work/storage/error except created at the last 3-5 days
7. Backup/Delete all files in $JS_HOME/work/storage/flow_instance_storage/finished , leave 5 days
8. Backup/Delete $LSF_HOME/work/<cluster>/logdir/lsb.event.* except created at the last 3-5 days ($LSF_HOME D=\LSF_51\work\cluster1\logdir)
9. Backup/Delete all files in $JS_HOME/work/storage/cache/
10. Backup/Delete $JS_HOME/log file located there!!
Same old nice trick 🙂 . Thanks for sharing @Timmy2383!
The SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment.
SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.
Find more tutorials on the SAS Users YouTube channel.