- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is it possible or even a good idea to execute parallel processing within a parallel process?
Here's my scenario:
I have 7 product tables that need to be created. The 7 product tables all run off of 2 main source tables and each product table has unique and shared lookup tables that are needed to create the final set of tables.
Option 1 (and my current design):
- 2 Main source tables are created (Premium & Loss tables). These programs are executed in parallel.
- Shared lookup table programs are executed in parallel. These shared tables are used by all 7 product table programs.
- 7 product table programs are executed in parallel. These programs include:
- Include statements to call various product-specific programs that create required lookup tables.
- Premium specific table section of code.
- Loss specific table section code.
- Section of code that combines the premium and loss data. Then joins on data from the shared and various product-specific lookup tables.
The above works fine but my thought is much of what happens in the 7 product table programs could also be run in parallel. Essentially would it be possible to kick off the 7 product table programs to run in parallel and within each of those sessions kick off another set of sessions running in parallel?
So step 3 above would look like the below where the first set programs running in parallel would be the 7 product programs with each product program then kicking off another 3 programs running in parallel (Lookup, Premium, & Loss) with a WaitFor _All_ that would bring the results from those 3 sessions back together in the Combine Program.
- Product Line 1
- Lookup Program
- Premium Program
- Loss Program
- Combine Premium, Loss, and Lookup tables Program to create final Product Line 1 table
- Product Line 2
- Lookup Program
- Premium Program
- Loss Program
- Combine Premium, Loss, and Lookup Tables Program to create final Product Line 2 Table
- etc. for Product Lines 3-7
Something like:
Program1 spawns 7 Parent parallel sessions and each Parent parallel spawns 3 child sessions.
Does this make sense? Is this possible? Is this even a good idea?
The reason I'm asking is the tables in this process are large so it takes a long time to create the final 7 tables even when running in parallel (option 1). My thought is that much of the Product Line specific programs can be broken into smaller programs that can also be run in parallel so why not look for a way to run as much as I can in parallel in an effort to speed the overall process. Basically, I rewrote an existing process and introduced option 1 and it's really reduced the time it takes to run this process. So just wondering if I can take it to another level and reduce the time to run even more!
Thanks!
-rileyd
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hey!
Just wanted to take a second and say thanks for all the responses. This definitely seems possible but I need to do some additional research regarding the system resources available to determine whether this makes sense to actually do or if I'll end up slowing down the process due to all the programs fighting for the same limited resources.
Thanks!
-rileyd
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is it possible
Yes
or even a good idea to execute parallel processing within a parallel process?
That depends.
Sometimes running too many parallel processes can actually increase the overall run time. It depends on the resources of your machine, esp. number of CPU's.
I assume you've got a "wrapper" or "driver" SAS program that launches other SAS programs in parallel based on your downstream dependencies? If so, this may (or may not) help:
https://github.com/scottbass/SAS/blob/master/Macro/RunAll.sas
Unfortunately I don't have an example of the programs dataset (a dataset which defines the dependencies) on GitHub. I've got it in my archive somewhere but don't have time to find it right now.
If you're running on Windows, this Powershell script is very useful for running multi-threaded code. You'd have to create a simple structure, such as an array or hash, to define your dependencies. It supports the MaxThreads option but, instead of SAS's WAITFOR, will launch a new thread as existing threads finish.
http://www.get-blog.com/?p=189
Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How long is a long time to run? Assuming you are running these in batch mode, does the job run within the time constraints of your batch window? If no, then exploring processing improvements appears worthwhile. If yes, why complicate processing further for no real benefit?
Also don't necessarily assume that further parallelisation will always improve run times. You might find you run into network bandwidth constraints for example. I've struck this issue myself - it is a common problem if the database server you are reading is not in the same data centre as your SAS servers.
There may be other options for improving performance like tuning complex queries, trying different read buffer sizes for external databases, using SAS dataset compression for example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
And for ODBC connections I have found dbsliceparm to be very useful.
Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Parallelization depends on the resources available. As soon as one critical resource (number of CPU cores, I/O or network bandwith, available memory) is saturated/exhausted, further parallelization will only result in (increased) competition for that resource and a further decrease of overall performance.
I always teach my users that they are better off by running jobs in succession instead of having multiple EG's running multiple processes, and I have introduced a limit on workspace server sessions per user. In the same vein, there is a limit for concurrent SAS batch jobs run from the scheduler.
So you should take a good look at your system's performance while you run your process in its current state. Do you see CPU cores idling, do you see I/O wait states, how much memory is still available, and what is your network throughput compared to available bandwidth (if there is some remote connection involved, eg to a DBMS)?
Maxim 30: Analyze, then optimize.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hey!
Just wanted to take a second and say thanks for all the responses. This definitely seems possible but I need to do some additional research regarding the system resources available to determine whether this makes sense to actually do or if I'll end up slowing down the process due to all the programs fighting for the same limited resources.
Thanks!
-rileyd