Ok, so this sounds slightly like something I've been working on.
First off, some principles:
- LSF runs each job completely separately, honouring the design of the flow in terms of sequence and dependencies
- each job run by LSF receives a unique "LSF Job Id"
- each instance of a flow receives a unique "LSF Flow Id" which persists across all job invocations as part of that instance of the flow
So the jobs within an instance of a flow are co-ordinated, but inherently don't share any context between them - apart from the name of the Flow they're in, and the LSF Flow Id. So there's no intention or mechanism to share anything between jobs, other than the LSF Flow Id.
So the suggestion of establishing some sort of shared data that's written by the first job, and then read by the following jobs is a clear way to establish "context" across the jobs in the flow, but as has been pointed out, this doesn't work if there are multiple concurrent instances of a flow. Unless you get cunning.
So, LSF does tackle this challenge, but you have to dig a bit - the "secret" is the LSF Flow Id. All jobs in an instance of a flow share the same Flow Id, and it's an inherent attribute, unique to each instance of a flow. So, it is quite possible to construct an framework to allow multiple instances of any flows to run simultaneously, and create a shared context across all the jobs that are part of the flow, discrete for each instance.
Now, it's generally the case that you only want one instance of a flow to run at a time, as that's how things normally go with classic batch ETL, but also to avoid table contention, race conditions etc. However there are definitely use cases where multiple instances make sense, as long as you can cope with the challenges mentioned.
I've recently been working on this, in pursuit of a concept I've loosely termed "Flow Swarming" - an approach to spawning multiple instances of a flow to parallelise stages of an ingest pipeline to maximise throughput, or take adaptive approaches to workload management like time-slot based throttling, or reacting to currently level of batch processing.
The core approach to the "shared context" challenge has been to create a macro to run at the beginning of all jobs that populates SAS macro variables for the LSF Job Id, Flow Id, Flow Name, OS PID, execution host and user etc. This allows jobs to create unique datasets, directories and files for each discrete instance of the flow using the Flow Id using these macro variables, which all jobs can leverage.
The brief details are - "LSF Job Id" is populated as an OS environment variable by default, which you can read with sysget('LSB_JOBID')
LSF Flow Id isn't present as an environment variable, but you can access it via the LSF bhist command, supplying the Job Id as a parameter and piping the output into some data step to parse out the Flow Id. (Caveat - this requires access to "XCMDs" - which can be contentious)
Overall, this nicely elaborates into a frame work for both detailed job execution logging, and a framework for allowing parallel execution of flows with a shared context across all jobs in the individual flow instances, unique to each instance.
You can layer a state-based flow triggering mechanism on top, to control flow triggering and the "flow swarming" dynamics.
Some of the basics of the state-based triggering are in the presentation I posted here earlier in the year on "Advanced ETL Scheduling" - swarming is an elaboration on this, which requires the distinct shared context mechanism I've touched on.
I've got a working prototype, and I'm hoping to be able to present it at soon.
... View more