Hi @PaigeMiller
Thanks for the answer. I know that code can be deployed from SAS Studio, and that there are possibilities for scheduling jobs and defining time-events and file-events as triggers, though I haven't done any experiments yet. But there are still a lot of unanswered questions. I apologize for bothering the community with a full novel, but I have done my best to focus on some major problems and omit a lot of minor details and problem areas.
Organizing objects
In our DI Studio Folder Tree, objects are organized in data areas. Each data area is a folder with subfolders for data produced in the area (tables and libraries), jobs producing these tables (with deployed jobs and the corresponding flow) and external files read/written by the jobs. The data areas are organized in hierarchies with permissions set at top level, and physical storage is organized in the same structure with inherited permissions.
This structure contains 2.629 folders in 785 data areas, and they contain about 21.000 objects (Jobs, Deployed Jobs, Libraries, Tables and external files – today’s count). We don't know how to maintain a similar easily manageable structure in SAS Studio, where each user has a limited view and access to physical data based on permissions, about 100 different AD Groups. We don't even know if it is possible to control logical access i SAS Studio based on AD Groups as we do today.
Jobs and flows
For maintenance reasons, we try to keep jobs small, meaning no more than about 50 DI Studio Transformations in a single job, preferably less, and only a single or a few tables as output. Then we build flows in SAS Management console, where we use a flow as a "running unit", a group of interconnected jobs with internal dependencies. A flow corresponds to one data area in the DI Studio Folder Tree, so the Folder Tree is the skeleton on which everything hangs.
We use LSF as a convenient way to get flows executed in a SAS Grid server cluster with load balancing, but we don't use Process Manager/LSF as a scheduler. The reason is that Process Manager cannot handle triggering based on previous flow events, only time events and file events. Of the 785 flows counted today, 114 are "initial" flows, meaning they are not dependent on results from previous flows and can be started based on time events only. The rest, 671 flows, depend on results from previous flows (mostly several flows) in a complicated hierarchy, where a chain can be more than 15 flows long, and a given chain can contain scores of previous flows.
This cannot be maintained manually, especially with a change rate of 220 new/changed jobs and 35 new/changed flows as a weekly average. We have built a scheduler that initiate each daily batch with building a virtual "super-flow" out of all flows that have the current day set as running date, and then proceeds with releasing or excluding flows based on previous results. It is fully automatic, so there is no manual charting of dependencies between flows or defining of triggers involved. The mechanism is based on table lineage extracted from SAS Metadata at batch start-time,
The structure has been allowed to grow into such complexity over the years because there is no manual work involved, so it has being going on without anybody realizing that it might be difficult to maintain in other environments. We use about 20 hours per week to promote jobs to production, schedule jobs and define flows, monitor batch execution, identify and correct errors, rerun failed flows etc. Thanks to automation of all processes from promotion to monitoring, this has been a constant over 12 years while our Data Warehouse has grown with a factor 10 or more.
And what now
This boils down to five technologies that we use today to run our Data Warehouse, and it seems that SAS Viya does not offer a similar functionality to any of them, except (maybe) no. 2. I have underlined what I consider to be the primary outcome of each technology, when it comes to building a similar batch environment in SAS Viya.
Deployment in DI Studio with automated generation of command strings to execute jobs.
Building Process Manager Flow Definitions in SAS MC with internal job dependencies.
Command-line execution of Flow Definitions by Process Manager.
Automated load balancing with LSF /Grid Manager in a server cluster with a shared file system.
Automatic charting of flow dependencies through lineage maintained in SAS Metadata.
Our current use of about 20 hours per week to maintain and run the batch environment with many daily changes makes us anxious. It will be difficult to get senior management to accept that migration to a new and hyped platform has a price. It will be hard to obtain sufficient resources for migration while getting the existing setup running smoothly in parallel, and even harder to get them to realize that the new and smart platform might be a setback to the old mainframe days requiring 10 employees in a separate operating team.
... View more