Re: LSF flow with several deployed jobs: how to run within a single SA...

strsljen · Posted 11-07-2018 11:37 AM

Hi,

I am trying to set a LSF flow with several SAS DI deployed jobs. Some of them I can run in parallel, some of them depend on other jobs from the flow.

The problem is: LSF runs each and every job in the flow as a separate session.

As a result of that, macro variables generated in first job are not visible in other jobs and this is mandatory condition for flow to run.

(have no issues in DI Studio with it)

Since LSF is my only option to schedule the flow with parallelism (where applicable), this behavior puts us back in sequential mode and thus increasing running time from estimated 1h30min to 4h.

Anyone knows if it's possible to convince LSF to run jobs within the same flow in a single SAS session?

Thanks!

Best regards.

--
Mario

LaurieF · Posted 11-07-2018 02:19 PM

LSF sessions are jobs that all run in their own little silos, without any knowledge of what's going on around them. LSF handles their success (or failure) and allows (or disallows) the jobs scheduled after them to proceed. So what you want to do, in the way you want to do it, lovely though it would be, isn't possible.

But here's what you should be able to do:

Isolate the part of your first job which creates the macro variables - hopefully it's all at the beginning
Copy the contents of these variables to a semi-permanent SAS dataset that all your jobs can see
- Hint: have a look at sashelp.vmacro
Split that job in two, and make the second half dependent on the first
Reverse the process from the first job to read the macro variables from the dataset
Depending on your environment, LSF will then parallelise as many jobs after that as it can

strsljen · Posted 11-07-2018 02:29 PM

Hi LaurieF,

Thanks for you answer. I will look into it.

Just one thought, though... How can I, that way, ensure that, if by any chance another instance of the flow gets executed, it doesn't mess with same variables (or overwrites them) coming from my original macros?

Originally, I get sequence ID from Oracle sequence at the beginning and hold it through the flow. All jobs using value of that macro variable. If each job has to read value from any shared data set, there is always a chance it gets messed with.

Best regards,

Mario.

--
Mario

LaurieF · Posted 11-07-2018 02:54 PM

OK - now you're making things a bit more complicated. If you thinking of sharing information across jobs, they won't be independent of each other. But remember that the contents of the macro variables you read from the dataset remain within each discrete job, unknown to others, so you will be safe there.

strsljen · Posted 11-08-2018 12:57 AM

Hi,

These jobs are not independent although some of them can be ran in parallel. They all deliver set of tables which present a single data mart.

So, at the end of the day, we need them as one flow, sharing the same macro variables (one of which is sequence key).

Thanks for clarification. I will ask formally our sas support for confirmation for LSF and sessions. If that is "as designed" solution, we will have to handle it in a different way.

--
Mario

LaurieF · Posted 11-08-2018 01:52 PM

As I said to @SASKiwi, it is possible to parallelise a DI job itself (I can't remember where the setting is), but the code it generates is hugely complex and I don't pretend to understand how it works. I wouldn't recommend it, mainly because in this situation it would break the tacit DI rule of having one job per output.

But it seems that the problem you are trying to solve is itself quite complex, outside the bounds of DI and LSF, and there may be a more efficient way to break it down so the inter-dependencies are fewer.

SASKiwi · Posted 11-07-2018 02:34 PM

Another way to do this, if you have SAS/CONNECT, would be to parallelise inside your single SAS session by creating additional child sessions. This has to be done in SAS code though so I don't know how compatible that will be with DI jobs.

LaurieF · Posted 11-07-2018 02:39 PM

It is possible to parallelise within a DI job, but it creates hugely complex code which I've never been happy with. It's more trouble than it's worth. Much easier to compartmentalise a DI process into discrete paths and let LSF handle everything.

AngusLooney · Posted 11-13-2018 07:23 PM

Ok, so this sounds slightly like something I've been working on.

First off, some principles:

- LSF runs each job completely separately, honouring the design of the flow in terms of sequence and dependencies

- each job run by LSF receives a unique "LSF Job Id"

- each instance of a flow receives a unique "LSF Flow Id" which persists across all job invocations as part of that instance of the flow

So the jobs within an instance of a flow are co-ordinated, but inherently don't share any context between them - apart from the name of the Flow they're in, and the LSF Flow Id. So there's no intention or mechanism to share anything between jobs, other than the LSF Flow Id.

So the suggestion of establishing some sort of shared data that's written by the first job, and then read by the following jobs is a clear way to establish "context" across the jobs in the flow, but as has been pointed out, this doesn't work if there are multiple concurrent instances of a flow. Unless you get cunning.

So, LSF does tackle this challenge, but you have to dig a bit - the "secret" is the LSF Flow Id. All jobs in an instance of a flow share the same Flow Id, and it's an inherent attribute, unique to each instance of a flow. So, it is quite possible to construct an framework to allow multiple instances of any flows to run simultaneously, and create a shared context across all the jobs that are part of the flow, discrete for each instance.

Now, it's generally the case that you only want one instance of a flow to run at a time, as that's how things normally go with classic batch ETL, but also to avoid table contention, race conditions etc. However there are definitely use cases where multiple instances make sense, as long as you can cope with the challenges mentioned.

I've recently been working on this, in pursuit of a concept I've loosely termed "Flow Swarming" - an approach to spawning multiple instances of a flow to parallelise stages of an ingest pipeline to maximise throughput, or take adaptive approaches to workload management like time-slot based throttling, or reacting to currently level of batch processing.

The core approach to the "shared context" challenge has been to create a macro to run at the beginning of all jobs that populates SAS macro variables for the LSF Job Id, Flow Id, Flow Name, OS PID, execution host and user etc. This allows jobs to create unique datasets, directories and files for each discrete instance of the flow using the Flow Id using these macro variables, which all jobs can leverage.

The brief details are - "LSF Job Id" is populated as an OS environment variable by default, which you can read with sysget('LSB_JOBID')

LSF Flow Id isn't present as an environment variable, but you can access it via the LSF bhist command, supplying the Job Id as a parameter and piping the output into some data step to parse out the Flow Id. (Caveat - this requires access to "XCMDs" - which can be contentious)

Overall, this nicely elaborates into a frame work for both detailed job execution logging, and a framework for allowing parallel execution of flows with a shared context across all jobs in the individual flow instances, unique to each instance.

You can layer a state-based flow triggering mechanism on top, to control flow triggering and the "flow swarming" dynamics.

Some of the basics of the state-based triggering are in the presentation I posted here earlier in the year on "Advanced ETL Scheduling" - swarming is an elaboration on this, which requires the distinct shared context mechanism I've touched on.

I've got a working prototype, and I'm hoping to be able to present it at soon.

LSF flow with several deployed jobs: how to run within a single SAS session

Re: LSF flow with several deployed jobs: how to run within a single SAS session

Re: LSF flow with several deployed jobs: how to run within a single SAS session

Re: LSF flow with several deployed jobs: how to run within a single SAS session

Re: LSF flow with several deployed jobs: how to run within a single SAS session

Re: LSF flow with several deployed jobs: how to run within a single SAS session

Re: LSF flow with several deployed jobs: how to run within a single SAS session

Re: LSF flow with several deployed jobs: how to run within a single SAS session

Re: LSF flow with several deployed jobs: how to run within a single SAS session

Ready to join fellow brilliant minds for the SAS Hackathon?