Solved: Re: DI Studio - How to Run Independent Queries Simultaneously?

JohnJPS · Posted 08-28-2017 11:04 AM

Suppose a simple DI Studio workflow where you extract subsets of information from disparate data sources (using simple joins with one input).

SAS always runs first one join transformation, while the second join, independent of the first, sits waiting.

Is there a setting in there anywhere that will allow indedendent operations to run simulatneously?

RLigtenberg · Posted 08-29-2017 04:00 PM

Just FYI

SAS Data Integration Studio 4.901 introduced the Fork transformations for parallelizing flows within a job:

http://support.sas.com/documentation/cdl/en/etlug/69395/HTML/default/viewer.htm#n0gxygc7jonw4mn1ucsi...

SAS Data Integration Studio provides a set of macros that are enabled via an option in the job properties:

http://support.sas.com/documentation/cdl/en/etlug/69395/HTML/default/viewer.htm#n1tbcw0rjm4bf1n1ozdy...

These macros are used by the Loop and the Fork transformations. They are available to the user.

View solution in original post

mkeintz · Posted 08-28-2017 08:22 PM

If all these joins essentially use the same "parent" datasets, then you might be able to use a DATA step to simultaneously create them all. For instance

data boys girls;

set sashelp.class;

if sex='M' then output boys;
else output girls;

run;

The savings derives from the fact that the parent dataset (sashelp.class) is read only once.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

JohnJPS · Posted 08-28-2017 09:30 PM

Hi @mkeintz

Thanks for the feedback. Unfortunately the data is pulled from totally disparate data sources. Sort of like:

(1) extract some claims data data

(2) extract some policy data

(3) extract some customer data

(4) join (1) and (2) and (3)

It seems like (1), (2), and (3) could take place simultaneously, but DI Studio runs them one at a time.

Patrick · Posted 08-29-2017 07:19 AM

@JohnJPS

DI Studio just generates SAS code (data steps and Procs) which get executed sequentially. SAS DIS doesn't provide process orchestration OOTB.

As @LinusH suggests either implement the extracts as separate jobs and implement parallelization via scheduling or use the DIS Loop transformation.

With the DIS Loop Transformation: You would have to wrap your inner job extract nodes into macros to only execute one extract per iteration of the inner job. You could use a Conditional Start/End Transformation and check for the value of a macro variable which you pass in via control table to the loop transformation.

I personally would go for multiple jobs and implement parallelization via scheduling. Having multiple simple jobs makes implementation, testing and operation simpler. If something falls over then it's easy to investigate, fix and re-run.

JohnJPS · Posted 08-29-2017 09:03 AM

Thanks; those two options are on the table, but I was hoping maybe I just missed something in DI Studio. I figure we'll go the road of multiple simultaneous jobs via our enterprise scheduler if necessary.

Patrick · Posted 08-29-2017 09:18 AM

@JohnJPS

I believe if you implement the target of your extracts as views and then use these views in the join then things would run simultaneously (as actually the extract logic gets only executed as part of the join).

A SAS SQL can execute multi-threaded so this should allow for some sort of parallelization. In my experience the SAS SQL optimizer doesn't always do a great job with complex joins so if such an approach would increase performance will depend on the actual extract and join logic as well as on the available disk I/O.

My design approach is to keep things in separate jobs. I rather go for multiple simple jobs with each job having a single target and only doing one thing than for something "elaborate". I've got always operations in mind so I try to keep dependencies in my jobs to a minimum. If a source table is not available or corrupted then have a simple job fall over and not a complex one as this will make debugging and re-running things much much easier.

LinusH · Posted 08-29-2017 02:25 AM

If you feel that the extract jobs takes considerably amount of time you could either have them in separate jobs (and atieing the intermediate extracts to a permanent table), or use the built on parallelixation features of DI Studio (uses MP CONNECT or an LSF Grid depending on your licence).

Data never sleeps

RLigtenberg · Posted 08-29-2017 04:00 PM

Just FYI

SAS Data Integration Studio 4.901 introduced the Fork transformations for parallelizing flows within a job:

http://support.sas.com/documentation/cdl/en/etlug/69395/HTML/default/viewer.htm#n0gxygc7jonw4mn1ucsi...

SAS Data Integration Studio provides a set of macros that are enabled via an option in the job properties:

http://support.sas.com/documentation/cdl/en/etlug/69395/HTML/default/viewer.htm#n1tbcw0rjm4bf1n1ozdy...

These macros are used by the Loop and the Fork transformations. They are available to the user.

JohnJPS · Posted 09-05-2017 02:24 PM

Thanks Robert - I was able to use "Fork" effectively on my problem.

Registration is open