Suppose a simple DI Studio workflow where you extract subsets of information from disparate data sources (using simple joins with one input).
SAS always runs first one join transformation, while the second join, independent of the first, sits waiting.
Is there a setting in there anywhere that will allow indedendent operations to run simulatneously?
Just FYI
SAS Data Integration Studio 4.901 introduced the Fork transformations for parallelizing flows within a job:
SAS Data Integration Studio provides a set of macros that are enabled via an option in the job properties:
These macros are used by the Loop and the Fork transformations. They are available to the user.
If all these joins essentially use the same "parent" datasets, then you might be able to use a DATA step to simultaneously create them all. For instance
data boys girls;
set sashelp.class;
   if sex='M' then output boys;
   else output girls;
run;
The savings derives from the fact that the parent dataset (sashelp.class) is read only once.
Hi @mkeintz
Thanks for the feedback. Unfortunately the data is pulled from totally disparate data sources. Sort of like:
(1) extract some claims data data
(2) extract some policy data
(3) extract some customer data
(4) join (1) and (2) and (3)
It seems like (1), (2), and (3) could take place simultaneously, but DI Studio runs them one at a time.
DI Studio just generates SAS code (data steps and Procs) which get executed sequentially. SAS DIS doesn't provide process orchestration OOTB.
As @LinusH suggests either implement the extracts as separate jobs and implement parallelization via scheduling or use the DIS Loop transformation.
With the DIS Loop Transformation: You would have to wrap your inner job extract nodes into macros to only execute one extract per iteration of the inner job. You could use a Conditional Start/End Transformation and check for the value of a macro variable which you pass in via control table to the loop transformation.
I personally would go for multiple jobs and implement parallelization via scheduling. Having multiple simple jobs makes implementation, testing and operation simpler. If something falls over then it's easy to investigate, fix and re-run.
I believe if you implement the target of your extracts as views and then use these views in the join then things would run simultaneously (as actually the extract logic gets only executed as part of the join).
A SAS SQL can execute multi-threaded so this should allow for some sort of parallelization. In my experience the SAS SQL optimizer doesn't always do a great job with complex joins so if such an approach would increase performance will depend on the actual extract and join logic as well as on the available disk I/O.
My design approach is to keep things in separate jobs. I rather go for multiple simple jobs with each job having a single target and only doing one thing than for something "elaborate". I've got always operations in mind so I try to keep dependencies in my jobs to a minimum. If a source table is not available or corrupted then have a simple job fall over and not a complex one as this will make debugging and re-running things much much easier.
Just FYI
SAS Data Integration Studio 4.901 introduced the Fork transformations for parallelizing flows within a job:
SAS Data Integration Studio provides a set of macros that are enabled via an option in the job properties:
These macros are used by the Loop and the Fork transformations. They are available to the user.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.
