Very interesting, only skim read so far.
One thing I would 100% endorse is the need for "instrumentation" on jobs, to log their execution, starts and stops. This is critical really, however you do it.
Definitely recognise the issues around naming conventions and structured approaches. To me, the need over a pervasive, overarching conceptual framework rapidily becomes critical.
In a way, we're dealing with a series of levels of consideration:
- the sequence and interdependancy of jobs within a "flow"
- the sequence and interdependancy of flows within an "estate"
I've recently been looking into the ideas/concepts around decomposing ingest processes into completely decoupled stages, where bundles of data transistion through a series of states, where those transistion happen by the actions of jobs/flows, being read as input and written as output, which is then the input the downstream processes.
The ideas of "data queues" and viewing instances of flows almost like "worker threads", including mutiple parallelised instances of the same flow, action on discretly allocated collections of the bundles of data.
It started from looking at the challenges of ingesting very large volumes of raw data files, particularly XML where neither "by file" or "all in one go" approaches are performant or sustainable.
Starts to morph into streaming territory, queues, prioritisation, "backpressure" and the like.
... View more