03-15-2016 06:27 AM
I'm writing an ETL program, which:
STEP 0. Defines lots of macrovariables and macro to initialize the environment
STEP 1. Runs a Query1 on Teradata, and processes the Query1 result (e.g sort, merge, ...); this step uses heavily all the macro and macrovariables initialized in step 0
STEP 2. Runs a Query2 on Teradata, and processes the Query2 result (e.g sort, merge, ...); this step uses heavily all the macro and macrovariables initialized in step 0
STEP 30. Runs a Query30 on Teradata, and processes the Query30 result (e.g sort, merge, ...); this step uses heavily all the macro and macrovariables initialized in step 0
STEP 31) Merges all results together
Up to now, all steps are sequential. However, I noticed that:
- step 0 is mandatory and must be run before anything
- steps 1-30 depend on step 0, and do not depend on each other
- step 31 depends on all previous step
So, I'd like very much to first run step 0, and then run steps 1-30 in parallel, and finally run step 31.
How could I archieve that? I read about rsubmit (we have SAS/CONNECT licensed) but I don't know how to let the parallel step to use macro and macrovariables defined in step 0 (and I can't move step 0 in each step 1-30 because some things must be initialized only once)
Some useful info:
- We use SAS 9.4 M2
- All code is run remotely on a single SAS server (there are not multiple servers and there is not a local session)
Example code I'd like to archieve:
/* STEP 0: environment initialization */ %let tms = %sysfunc(datetime(),B8601DT26.6); %macro test(stepno); %put I am running step &stepno and the timestamp is &tms; %mend test; /* to be run in parallel */ /* STEP 1 */ %test(1); /* STEP 2 */ %test(2); /* ... */ /* STEP 30 */ %test(30);
I expect to see in the log:
I am running step 1 and the timestamp is 20160315T112545225481;
I am running step 2 and the timestamp is 20160315T112545225481;
I am running step 30 and the timestamp is 20160315T112545225481;
e.g. the same timestamp in each step, as it would be if I run the steps sequentially.
Thank you so much for your help.
03-15-2016 06:48 AM
As far as I am aware, your program operates linearly, i.e. one step after the other. Not sure how mutiple streams would work, it would reuire creating multiple instances of the PDV, having multiple read/write streams etc. Why do you need to do this, unless your data is in the terabytes, when you should be looking at grid computing or big data processing anyways, you wont gain anything by paralel running.
03-15-2016 06:49 AM
I'll take it that you don't use DI Studio for your ETL? This kind of logic is pretty much out of the box in DI Studio.
You can use %SYSLPUT to define/assign values to a remote SAS session:
Another option could be to store the macro variable values in a permanent SAS data set, and then have them created as macro variables in each job. The benefit here is that you can define your job order in the tool of your choice (not by hard coded SAS programs). If you have SAS Management Console, you can define job flows there, as an eaxample. Also, logs created will be smaller and easier to browse, and this IMO increases restart-ability for your job flow.
03-15-2016 07:35 AM
Split all steps into different .sas files.
Include step 0 (with %include) in the other steps.
Have your scheduling system run steps 1 to 30 as batch jobs in parallel, and run step 31 when all those have finished successfully.