Hi @AngusLooney
I agree, mining job code can provide very useful information. In most cases I perfer to use metadata to get job information, because we don't have deployed code in our development environment, where we use data for quality control, but we use the deployed code to find warnings about "columns left empty", because it is rather tricky to get that information from metadata.
I have read your SAS Forum paper "Advanced ETL Scheduling Techniques", and I think our way of using metadata in our sceduling might be of interest to you. Please forgive my clumsy language - english is not my native tongue...
Background
We are a danish municipality administration, and we have used SAS for more than 30 years first on mainframe, later Windows servers and now Linux Grid. In 2007 we moved our SAS data warehouse to DI Studio/LSF. It is constantly evolving, new jobs are added and existing job changed, often with changes in input tables, on a daily basis, and we have p.t. about 4000 jobs in 600 LSF flows using 6500 permanent tables, 900 external files and 25000 work tables in our daily batch.
The structure is complex, with chains up to 30 flows long, where jobs in a flow depends on data from jobs in other flows, and many tables are used as input to many jobs, up to 100 jobs in 40 flows using the same table. Many flows have special run dates, but most run the night before each workday.
We have ETL-developers in 6 different departments on different locations. All developers work in a development environment, and the production environment is centrally maintained, and besides naming conventions we have two rules for jobs: a LSF Flow corresponds to one job folder in DI Studio, and a given flow is limited to updating one library only.
Developers request promotion by exporting the relevant items and sending a mail with a link to the spk-package, a screen-shot from DI Studio with the included items marked and notes about special run frequency etc. Then we import the package and deploy the jobs. Manitaining LSF flows and scheduling is a part of the promotion process too.
Problem
As the number of jobs and flows grew, we ran into problems. Building and maintaining LSF flows became very time consuming, as we had to open all jobs to determine the internal dependencies between jobs in a flow, so it could be defined with correct job dependencies and control nodes.
It also became more and more difficult to control flow triggering, because each new or changed job might change the dependency on previous flows, and we had to open all jobs to find out which flows wrote the input tables, so we could ensure that a flow didn't start before all previous flows were finished. We used file event triggering by adding a last job to each flow. This job wrote a semaphore file upon completion, and all dependent flows were set up to use these files as triggering events.
In addition we had a lot of trouble with enforcing naming conventions and other rules, solving deadlocks caused by crossing dependencies etc, and the semaphore file method didn't fulfill our needs, because we never figured aut how to get it working with different running schedulles, and we never found a way to exclude a flow from running, if the previous flow failed, so output tables where left undamaged, but still get further flows to run in cases where yesterday's data in one contributing table was acceptable.
in 2010 we ran out of resources. We were 5 persons responsible for running servers, maintaining the development and production environments, get new data sources into the warehouse, promote jobs, run the batch, maintain our SAS portal, find and solve job errors and guide 20 ETL-developers and more than 100 EG users. So something had to be done.
Solution
We became aware that when we used DI Studio to extract all the necessary information to determine job and flow dependencies, we actually used it as a metadata browser, so we could use a program to extract the same information. That lead to:
1) A "Virtual Auditor", a flow running in daily batch with jobs checking violation of naming conventions, missing libname descriptions, userwritten code in metadata (we want all userwritten code as sas files), reading from/writing to "illegal" levels in the data warehouse structure, passwords in code, unmapped columns etc. The information is presented as a SAS report and also mailed to the responsible developers (LastUpdated ResponsibleParty).
2) A Visual Studio application to assist with flow definitions. The application has a left pane with the metadata tree structure similar to DI Studio, and click on a job folder analyses all deployed jobs in the folder and draws all jobs and dependencies in the right pane similar to Schedule Manager. With our application on one screen and the Schedule Manager edit pane on another it is very easy and quick to create or change a flow. The application calls a SAS program to get the folder tree and a parent-child structure for the jobs in the folder, and thanks to Proc metadata it is fast, less than 5 seconds to produce the drawing.
3) A "Virtual Operator", which is home-made scheduler with the main purpose of avoiding all definitions of triggering events in LSF. All flows are just set to "Run manually in the scheduling server" in SAS MC. It consists of 3 parts:
a) A metadata extract running before the daily batch starts. The result is a database containing actual tables with jobs, flows and flow dependencies. It also contains other tables, one is an updated flow attribute table with running dates, start-before and start-after times, another is an updated table with flow dependencies and file dependencies. A new flow has default values from a parm table, but the values can be edited, so (ex.) a flow is allowed a flow to start even if a previuos flow is exculded from run, or an input file must not only exist, but has to be updated within the last 24 hours.
b) A SAS job that starts at 6 PM every day and builds tables with flows (status "waiting" if included in the actual batch or otherwise "not-on-runlist") and actual jobs in these flows. All dependencies are loaded into macro variables as logic expressions, and all flows are loaded into macro variables with actual status. This is followed by a looping proces, where
- an actual status is computed for all flows based on status for the jobs in the data base, - all flow macro variables are loaded with actual status taken from the data base, and - all expressions are evaluated, and based on that a flow is either marked "excluded from run" or submitted to LSF, and flow macro variables are updated accordingly.
This is repeated every 30 seconds until the loop stops at 4 PM the following day, but normally the list is exhausted with all jobs having a final status of Done or in some cases Failed or Excluded at 6 AM.
c) A visual studio application to control things. It has a main pane with all flows ordered and color marked by actual status and with information on start and end times, average run times etc, and a right click gives a menu to se actual status for jobs in the flow with times and mean times, browse log files, show runtime graphs, edit dependencies and rundates, set comments on failed flows and start/restart flows including flows not in the actual batch.
So most things can be done from the application. If a flow is failed, the normal process will be to open the job list and see which job failed, then browse the log file, and - depending on the error - try to rerun it or correct it in DI Studio, redeploy the code and then rerun the flow. If the flow is done, the SAS loop wil detect it, so flows formerly excluded are taken up again, and the whole excluded branch run to end.
This depends on one thing only. The data base must be updated whenever a job starts or ends. To that purpose a Initstmt and a Termstmt parameter is added to the command string parameters for deployed jobs, and this is the only change to the standard SAS installation. This way, a job starts with a macro logging start time and log file name in the database, and it ends with a macro logging end time and return code. Because of this implementation flows can be run and rerun not only from the application, but also from SAS MC or FlowManager and still interact with the scheduler.
Conclusion
It changed our work totally. We had less than 1000 jobs in 2010, and it took about 100 hours weekly to control job quality, promote jobs and run the batch. Now we have at least four times as many jobs and a more complicated setup, and deployment/promotion takes maybe 5 hours weekly, and the batch can run for weeks without any intervention except an hour or two to rerun failed jobs. So now we have got time to keep everything nice and tidy, support the developers and do some development ourselves.
We wrote the sas code ourselves, it took about a month, and we were lucky to have a collegue experienced in Visual Studio to write the applications. It has proved very roubust, it has almost never failed in 8 years, The only problem has been a few cases of network problems. And it has survived migration through all SAS generations from Windows 9.3.1 to Linux 9.4M5 with a few minor adjustments.
... View more