Parallel Data Jobs in a Process Job

Joydip · Posted 01-22-2014 02:10 AM

Hi all,

I have created one process job, which contains 15 data job so it took about 10 hours in running a job with 1.6 million data, Is it possible to run the data jobs parallel so that it can reduce the time a bit ?

Thanks and Regards

Joydip Ghosh

LinusH · Posted 01-22-2014 10:06 AM

Feauture wise, yes. How is applikation dependent.

Data never sleeps

LinusH · Posted 01-22-2014 10:08 AM

To be able to help yu on how, please elaborate on how your job flow is built. How many separate data streams do you have, data/job dependencies, what HW reources are available to you?

Data never sleeps

Joydip · Posted 01-22-2014 10:24 AM

Hi Linus,

The job has 15 data flows all are dependent on the 1st data flow node, which creates one value, but the other 14 are independent of all others. Now it is working like after completion of 1st node, 2nd node starts, then after completion of 2nd the 3rd starts. if we can do it parallel then I think it will be fast. How can we do that, should we require, Fork or Parallel iteration node transforms.

Thanks and Regards

Joydip

LinusH · Posted 01-22-2014 11:09 AM

If I understand you correctly, that there are 14 different jobs, then I would build job dependencies using the Schedule Manager plugin in Management Console.

If there are one job but with different set of input data, I would solve it using Loop-transformation in DI Studio. Parallelism is managed by setting parameters accordingly.

Data never sleeps

skillman · Posted 01-22-2014 11:31 AM

Joydip,

You can run the first data job then link it to a Fork Node which would contain all of the other 14 data jobs. The Fork node will allow the 14 jobs to run in parallel once the first node completes.

See attached screenshot.

-shawn

Joydip · Posted 01-27-2014 02:50 AM

Hi Linus, Shawn,

Thanks for the update using Fork, but we are still facing a lot of Issue, sometimes

a) out of memory.

Is there any way to control the memory to be allocated.

Secondly all our data jobs are writing the exceptions into a same table so the connection is refused by the server as two threads are trying to update the same table at same point of time and the job fails, is there any alternative way.

All suggestions are welcome.

Thanks and Regards

Joydip

skillman · Posted 01-27-2014 08:58 AM

Joydip,

Specifically which node is failing due to memory issues? There are node specific memory tweaks you can make. Also, how much physical memory is available on the computer that is running this job? What type of table are you writing the exceptions to? It sounds like it may be a table or a database that does not allow simultaneous connections.

-shawn

Joydip · Posted 01-27-2014 09:08 AM

Hi Shawn,

Thanks for your reply, My system have 2GB Ram, and the database is an sql server database, and we are writing the exceptions to the table itself we are using macros to connect to the database and then write to the exception table. We are using an expression where we are opening and closing the connection and write the details with the help of that we are writing the exceptions records to the database. and this expression node is failing due to the memory issue.

Thanks and Regards.

Joydip

skillman · Posted 01-27-2014 11:06 AM

Joydip,

SQL Server uses pessimistic concurrency by default: This is stated in the offical MS documention. Have you thought about using the Data Target (Insert) node and passing the macros into it? Opening and Closing the database through an expression may be another reason the job is failing. You can control commit intervals in the Data Target (Insert) node which could help performance as well.

Pessimistic Concurrency

Default behavior: acquire locks to block access to data that another process is using.

Optimistic Concurrency

Assumes that there are sufficiently few conflicting data modification operations in the system that any single transaction is unlikely to modify data that another transaction is modifying.

Hope this helps,

-shawn

LinusH · Posted 01-22-2014 01:51 PM

So, you work in Data Flux, that information could have been useful... 😉

Data never sleeps

Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

Re: Parallel Data Jobs in a Process Job

The 2025 SAS Hackathon has begun!