BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rileyd
Quartz | Level 8

Is it possible or even a good idea to execute parallel processing within a parallel process? 

 

Here's my scenario:

I have 7 product tables that need to be created. The 7 product tables all run off of 2 main source tables and each product table has unique and shared lookup tables that are needed to create the final set of tables.

 

Option 1 (and my current design):

  1. 2 Main source tables are created (Premium & Loss tables). These programs are executed in parallel.
  2. Shared lookup table programs are executed in parallel. These shared tables are used by all 7 product table programs.
  3. 7 product table programs are executed in parallel. These programs include:
    1. Include statements to call various product-specific programs that create required lookup tables.
    2. Premium specific table section of code.
    3. Loss specific table section code.
    4. Section of code that combines the premium and loss data. Then joins on data from the shared and various product-specific lookup tables.

The above works fine but my thought is much of what happens in the 7 product table programs could also be run in parallel. Essentially would it be possible to kick off the 7 product table programs to run in parallel and within each of those sessions kick off another set of sessions running in parallel?

 

So step 3 above would look like the below where the first set programs running in parallel would be the 7 product programs with each product program then kicking off another 3 programs running in parallel (Lookup, Premium, & Loss) with a WaitFor _All_ that would bring the results from those 3 sessions back together in the Combine Program.

 

  1. Product Line 1
    1. Lookup Program
    2. Premium Program
    3. Loss Program
    4. Combine Premium, Loss, and Lookup tables Program to create final Product Line 1 table
  2. Product Line 2
    1. Lookup Program
    2. Premium Program
    3. Loss Program
    4. Combine Premium, Loss, and Lookup Tables Program to create final Product Line 2 Table
  3. etc. for Product Lines 3-7

 

Something like:

Program1 spawns 7 Parent parallel sessions and each Parent parallel spawns 3 child sessions.

 

Does this make sense? Is this possible? Is this even a good idea? 

 

The reason I'm asking is the tables in this process are large so it takes a long time to create the final 7 tables even when running in parallel (option 1). My thought is that much of the Product Line specific programs can be broken into smaller programs that can also be run in parallel so why not look for a way to run as much as I can in parallel in an effort to speed the overall process. Basically, I rewrote an existing process and introduced option 1 and it's really reduced the time it takes to run this process. So just wondering if I can take it to another level and reduce the time to run even more!

 

Thanks!

-rileyd

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
rileyd
Quartz | Level 8

Hey!

 

Just wanted to take a second and say thanks for all the responses. This definitely seems possible but I need to do some additional research regarding the system resources available to determine whether this makes sense to actually do or if I'll end up slowing down the process due to all the programs fighting for the same limited resources. 

 

Thanks!

-rileyd 

View solution in original post

5 REPLIES 5
ScottBass
Rhodochrosite | Level 12

Is it possible

 

Yes

 

or even a good idea to execute parallel processing within a parallel process? 

 

That depends.

 

Sometimes running too many parallel processes can actually increase the overall run time.  It depends on the resources of your machine, esp. number of CPU's.

 

I assume you've got a "wrapper" or "driver" SAS program that launches other SAS programs in parallel based on your downstream dependencies?  If so, this may (or may not) help:

 

https://github.com/scottbass/SAS/blob/master/Macro/RunAll.sas

 

Unfortunately I don't have an example of the programs dataset (a dataset which defines the dependencies) on GitHub.  I've got it in my archive somewhere but don't have time to find it right now.

 

If you're running on Windows, this Powershell script is very useful for running multi-threaded code.  You'd have to create a simple structure, such as an array or hash, to define your dependencies.  It supports the MaxThreads option but, instead of SAS's WAITFOR, will launch a new thread as existing threads finish.

 

http://www.get-blog.com/?p=189

 

 


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
SASKiwi
PROC Star

How long is a long time to run? Assuming you are running these in batch mode, does the job run within the time constraints of your batch window? If no, then exploring processing improvements appears worthwhile. If yes, why complicate processing further for no real benefit?

 

Also don't necessarily assume that further parallelisation will always improve run times. You might find you run into network bandwidth constraints for example. I've struck this issue myself - it is a common problem if the database server you are reading is not in the same data centre as your SAS servers.

 

There may be other options for improving performance like tuning complex queries, trying different read buffer sizes for external databases, using SAS dataset compression for example.

ScottBass
Rhodochrosite | Level 12

And for ODBC connections I have found dbsliceparm to be very useful.


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Kurt_Bremser
Super User

Parallelization depends on the resources available. As soon as one critical resource (number of CPU cores, I/O or network bandwith, available memory) is saturated/exhausted, further parallelization will only result in (increased) competition for that resource and a further decrease of overall performance.

I always teach my users that they are better off by running jobs in succession instead of having multiple EG's running multiple processes, and I have introduced a limit on workspace server sessions per user. In the same vein, there is a limit for concurrent SAS batch jobs run from the scheduler.

So you should take a good look at your system's performance while you run your process in its current state. Do you see CPU cores idling, do you see I/O wait states, how much memory is still available, and what is your network throughput compared to available bandwidth (if there is some remote connection involved, eg to a DBMS)?

 

Maxim 30: Analyze, then optimize.

rileyd
Quartz | Level 8

Hey!

 

Just wanted to take a second and say thanks for all the responses. This definitely seems possible but I need to do some additional research regarding the system resources available to determine whether this makes sense to actually do or if I'll end up slowing down the process due to all the programs fighting for the same limited resources. 

 

Thanks!

-rileyd 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1226 views
  • 1 like
  • 4 in conversation