Desktop productivity for business analysts and programmers

In SAS DI 4.2 , can a job be executed on multiples nodes (CPUs)?

Reply
N/A
Posts: 0

In SAS DI 4.2 , can a job be executed on multiples nodes (CPUs)?

Lets say I have a 4.2 DI job which is as follows --

SAS extract(40 million) --> SQL JOin with A --> Tempobject1 --> Lookup with B --> Tempobject2 --> SCD Loader --> Tempobject3 --> Table Loader --> ORA target

I'm running this job on a server which has 4 nodes(CPUs)

My question is is it possible in DI studio to run this job on all 4 CPUS simultaneosly. By this I mean the SAS extarct of 40 million will be split into say 10 million each and each CPU will read 10 million, do necessary lookups/sql joins etc and finally load in ORA , and that too in parallel.

This is a common feature in ETL tools like Datastage/Informatica, just wondering if DI studio has anything like this.

Appreciate any inputs on this!
Super User
Posts: 5,382

Re: In SAS DI 4.2 , can a job be executed on multiples nodes (CPUs)?

Well, this takes some explanation. First one to have to understand what SAS can do in parallel and how, and then how that affects your DI studio design.

I/O: SAS can do multithreaded reads if the data source supports that. This includes SAS engines SPDE and SPDS, and multiple external RDMS via SAS/ACCESS (including Oracle). This will occur "automatically", and is maintained by system, libname and data set options.

Processing: on a row level, SAS has a couple of procedures (including part of SQL) that run in parallel. If your source data to a step reside in Oracle, SAS can push down the SQL, and then the RDBMS is responsible for the parallelism.

On a higher level, when you have different chunks of data, the can be processed concurrently. This is done on application level using either MP CONNECT or Grid Manager (which in turn uses Platform LSF). Which to use of these depends on your license and your server infrastructure. To use this kind of parallelism you have to build your ETL flow accordingly. Take a look at DI Studio doc for guidance.

Conclusion: your flow is probably already partly parallelized. To make it use parallel steps you have look at your options setting, Oracle load settings. If the need urgent, you may have to rebuild the to split your data to able to make it run in parallel, but be "careful", not all type of processes will benefit from this.

/Linus
Data never sleeps
Ask a Question
Discussion stats
  • 1 reply
  • 120 views
  • 0 likes
  • 2 in conversation