BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JohnJPS
Quartz | Level 8

Suppose a simple DI Studio workflow where you extract subsets of information from disparate data sources (using simple joins with one input).

 

SAS always runs first one join transformation, while the second join, independent of the first, sits waiting.

 

Is there a setting in there anywhere that will allow indedendent operations to run simulatneously?

 

1 ACCEPTED SOLUTION

Accepted Solutions
RLigtenberg
SAS Employee

Just FYI

 

SAS Data Integration Studio 4.901 introduced the Fork transformations for parallelizing flows within a job:

http://support.sas.com/documentation/cdl/en/etlug/69395/HTML/default/viewer.htm#n0gxygc7jonw4mn1ucsi...

 

SAS Data Integration Studio provides a set of macros that are enabled via an option in the job properties:

http://support.sas.com/documentation/cdl/en/etlug/69395/HTML/default/viewer.htm#n1tbcw0rjm4bf1n1ozdy...

These macros are used by the Loop and the Fork transformations. They are available to the user.

View solution in original post

8 REPLIES 8
mkeintz
PROC Star

If all these joins essentially use the same "parent" datasets, then you might be able to use a DATA step to simultaneously create them all.  For instance

 

data boys girls;

   set sashelp.class;

   if sex='M' then output boys;
   else output girls;

run;

 

The savings derives from the fact that the parent dataset (sashelp.class) is read only once.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
JohnJPS
Quartz | Level 8

 

Hi @mkeintz

Thanks for the feedback.  Unfortunately the data is pulled from totally disparate data sources.  Sort of like:

 

(1) extract some claims data data

 

(2) extract some policy data

 

(3) extract some customer data

 

(4) join (1) and (2) and (3)

 

It seems like (1), (2), and (3) could take place simultaneously, but DI Studio runs them one at a time.

 

Patrick
Opal | Level 21

@JohnJPS

DI Studio just generates SAS code (data steps and Procs) which get executed sequentially. SAS DIS doesn't provide process orchestration OOTB.

As @LinusH suggests either implement the extracts as separate jobs and implement parallelization via scheduling or use the DIS Loop transformation.

 

With the DIS Loop Transformation: You would have to wrap your inner job extract nodes into macros to only execute one extract per iteration of the inner job. You could use a Conditional Start/End Transformation and check for the value of a macro variable which you pass in via control table to the loop transformation.

 

I personally would go for multiple jobs and implement parallelization via scheduling. Having multiple simple jobs makes implementation, testing and operation simpler. If something falls over then it's easy to investigate, fix and re-run. 

JohnJPS
Quartz | Level 8
Thanks; those two options are on the table, but I was hoping maybe I just missed something in DI Studio. I figure we'll go the road of multiple simultaneous jobs via our enterprise scheduler if necessary.
Patrick
Opal | Level 21

@JohnJPS

I believe if you implement the target of your extracts as views and then use these views in the join then things would run simultaneously (as actually the extract logic gets only executed as part of the join).

A SAS SQL can execute multi-threaded so this should allow for some sort of parallelization. In my experience the SAS SQL optimizer doesn't always do a great job with complex joins so if such an approach would increase performance will depend on the actual extract and join logic as well as on the available disk I/O.

 

My design approach is to keep things in separate jobs. I rather go for multiple simple jobs with each job having a single target and only doing one thing than for something "elaborate". I've got always operations in mind so I try to keep dependencies in my jobs to a minimum. If a source table is not available or corrupted then have a simple job fall over and not a complex one as this will make debugging and re-running things much much easier.

LinusH
Tourmaline | Level 20
If you feel that the extract jobs takes considerably amount of time you could either have them in separate jobs (and atieing the intermediate extracts to a permanent table), or use the built on parallelixation features of DI Studio (uses MP CONNECT or an LSF Grid depending on your licence).
Data never sleeps
RLigtenberg
SAS Employee

Just FYI

 

SAS Data Integration Studio 4.901 introduced the Fork transformations for parallelizing flows within a job:

http://support.sas.com/documentation/cdl/en/etlug/69395/HTML/default/viewer.htm#n0gxygc7jonw4mn1ucsi...

 

SAS Data Integration Studio provides a set of macros that are enabled via an option in the job properties:

http://support.sas.com/documentation/cdl/en/etlug/69395/HTML/default/viewer.htm#n1tbcw0rjm4bf1n1ozdy...

These macros are used by the Loop and the Fork transformations. They are available to the user.

JohnJPS
Quartz | Level 8
Thanks Robert - I was able to use "Fork" effectively on my problem.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2513 views
  • 5 likes
  • 5 in conversation