Solved: Re: Parallel Processing within a Parallel Process?

rileyd · Posted 06-20-2019 06:35 PM

Is it possible or even a good idea to execute parallel processing within a parallel process?

Here's my scenario:

I have 7 product tables that need to be created. The 7 product tables all run off of 2 main source tables and each product table has unique and shared lookup tables that are needed to create the final set of tables.

Option 1 (and my current design):

2 Main source tables are created (Premium & Loss tables). These programs are executed in parallel.
Shared lookup table programs are executed in parallel. These shared tables are used by all 7 product table programs.
7 product table programs are executed in parallel. These programs include:
1. Include statements to call various product-specific programs that create required lookup tables.
2. Premium specific table section of code.
3. Loss specific table section code.
4. Section of code that combines the premium and loss data. Then joins on data from the shared and various product-specific lookup tables.

The above works fine but my thought is much of what happens in the 7 product table programs could also be run in parallel. Essentially would it be possible to kick off the 7 product table programs to run in parallel and within each of those sessions kick off another set of sessions running in parallel?

So step 3 above would look like the below where the first set programs running in parallel would be the 7 product programs with each product program then kicking off another 3 programs running in parallel (Lookup, Premium, & Loss) with a WaitFor _All_ that would bring the results from those 3 sessions back together in the Combine Program.

Product Line 1
1. Lookup Program
2. Premium Program
3. Loss Program
4. Combine Premium, Loss, and Lookup tables Program to create final Product Line 1 table
Product Line 2
1. Lookup Program
2. Premium Program
3. Loss Program
4. Combine Premium, Loss, and Lookup Tables Program to create final Product Line 2 Table
etc. for Product Lines 3-7

Something like:

Program1 spawns 7 Parent parallel sessions and each Parent parallel spawns 3 child sessions.

Does this make sense? Is this possible? Is this even a good idea?

The reason I'm asking is the tables in this process are large so it takes a long time to create the final 7 tables even when running in parallel (option 1). My thought is that much of the Product Line specific programs can be broken into smaller programs that can also be run in parallel so why not look for a way to run as much as I can in parallel in an effort to speed the overall process. Basically, I rewrote an existing process and introduced option 1 and it's really reduced the time it takes to run this process. So just wondering if I can take it to another level and reduce the time to run even more!

Thanks!

-rileyd

rileyd · Posted 06-24-2019 11:04 AM

Hey!

Just wanted to take a second and say thanks for all the responses. This definitely seems possible but I need to do some additional research regarding the system resources available to determine whether this makes sense to actually do or if I'll end up slowing down the process due to all the programs fighting for the same limited resources.

Thanks!

-rileyd

View solution in original post

ScottBass · Posted 06-20-2019 07:05 PM

Is it possible

Yes

or even a good idea to execute parallel processing within a parallel process?

That depends.

Sometimes running too many parallel processes can actually increase the overall run time. It depends on the resources of your machine, esp. number of CPU's.

I assume you've got a "wrapper" or "driver" SAS program that launches other SAS programs in parallel based on your downstream dependencies? If so, this may (or may not) help:

https://github.com/scottbass/SAS/blob/master/Macro/RunAll.sas

Unfortunately I don't have an example of the programs dataset (a dataset which defines the dependencies) on GitHub. I've got it in my archive somewhere but don't have time to find it right now.

If you're running on Windows, this Powershell script is very useful for running multi-threaded code. You'd have to create a simple structure, such as an array or hash, to define your dependencies. It supports the MaxThreads option but, instead of SAS's WAITFOR, will launch a new thread as existing threads finish.

http://www.get-blog.com/?p=189

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

SASKiwi · Posted 06-20-2019 08:45 PM

How long is a long time to run? Assuming you are running these in batch mode, does the job run within the time constraints of your batch window? If no, then exploring processing improvements appears worthwhile. If yes, why complicate processing further for no real benefit?

Also don't necessarily assume that further parallelisation will always improve run times. You might find you run into network bandwidth constraints for example. I've struck this issue myself - it is a common problem if the database server you are reading is not in the same data centre as your SAS servers.

There may be other options for improving performance like tuning complex queries, trying different read buffer sizes for external databases, using SAS dataset compression for example.

ScottBass · Posted 06-20-2019 10:00 PM

And for ODBC connections I have found dbsliceparm to be very useful.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Kurt_Bremser · Posted 06-21-2019 03:21 AM

Parallelization depends on the resources available. As soon as one critical resource (number of CPU cores, I/O or network bandwith, available memory) is saturated/exhausted, further parallelization will only result in (increased) competition for that resource and a further decrease of overall performance.

I always teach my users that they are better off by running jobs in succession instead of having multiple EG's running multiple processes, and I have introduced a limit on workspace server sessions per user. In the same vein, there is a limit for concurrent SAS batch jobs run from the scheduler.

So you should take a good look at your system's performance while you run your process in its current state. Do you see CPU cores idling, do you see I/O wait states, how much memory is still available, and what is your network throughput compared to available bandwidth (if there is some remote connection involved, eg to a DBMS)?

Maxim 30: Analyze, then optimize.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rileyd · Posted 06-24-2019 11:04 AM

Hey!

Just wanted to take a second and say thanks for all the responses. This definitely seems possible but I need to do some additional research regarding the system resources available to determine whether this makes sense to actually do or if I'll end up slowing down the process due to all the programs fighting for the same limited resources.

Thanks!

-rileyd

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!