BookmarkSubscribeRSS Feed
AndreasEM
Fluorite | Level 6

hi,

 

We're running flows in batch on LSF (version 9.1.1.1) in our environment. We struggle with (from time to time) that flows stop between jobs for no particular reason or that the first job never gets started. The flow status is "running" but no jobs are running in the flow; either some jobs have finished with exit code 0 and the next job(s) in line are waiting or the first job never gets started. A typical dependency that we use is "Exit code less than 2".

 

It's hard to look at logs since there're lack of actions that we struggle with. Anyone seen this?

Thanks! 

8 REPLIES 8
AndreasEM
Fluorite | Level 6

Yes they stop indefinitely. Usually we notice when the next instance of the flow is "waiting" (due to restrction of only allowing one instance of the flow to run at a given time).

 

I don't think we've hit any processing limit, even so, wouldn't the job just queue until a free resource?

Kurt_Bremser
Super User

@AndreasEM wrote:

 

 

I don't think we've hit any processing limit, even so, wouldn't the job just queue until a free resource?


Yes, that's what I wanted to point at. The indefinite stop means that something else is going on.

jklaverstijn
Rhodochrosite | Level 12

Typically I would start looking at flow and job status using the LSF Flowmanager application that comes with your LSF distribution. It will show you status and history of both flows and jobs and tell you why things are the way they are. Eg a flow can be suspended because another instance is already running and the "only one at a time" box was checked.

 

Another reason could be exhaustion of job slots for your cluster or specific queue.

 

Also these command line commands are useful:

 

bhosts

lshosts

bjobs

bhist

jhist

 

They can also tell you if the submission hosts and job slots are available.

 

Bottom line is to use the LSF / Process Manager tool to diagnose this. The SAS logs won't help since SAS isn't running in the first place.

 

Hope this helps,

- Jan.

AndreasEM
Fluorite | Level 6

yes we use "only run one instance" on our flows so we typically find out that flow stalled when the next instance stands on "waiting".

 

Flow manager doesn't give us much looking at this, the flow looks perfectly fine, flow status running except that nothing is running.. 

 

We also have the Platform RTM web interface, but I haven't used that a lot. 

 

I'll look into the line commands, I've used some of them, but not all, thanks!

jklaverstijn
Rhodochrosite | Level 12

On a side note: do you use "Exit code less than 2" because you do not want warnings to stop your flow? Consider uncommenting following lines in sasbatch.sh:

 

rc=$JOB_RC
if [ $rc -eq 1 ]; then
  rc=0
fi
exit $rc

It allows you to work with "job ended successfully" dependencies and will not make your job show up in red in Flowmanager.

 

This supposes Linux/Unix but I would assume something similar to be available in Windows.

 

Hope this helps,

-- Jan.

AndreasEM
Fluorite | Level 6

thanks! Typically we wan't to be aware of the warning, but not for them to stop the batch, but it's really considered case-by-case.

AndreasEM
Fluorite | Level 6

Seems like the root cause for the stalling flows was high swap usage on our Linux grid nodes. Thanks for your input! 🙂

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1816 views
  • 5 likes
  • 3 in conversation