11-28-2017 08:42 PM
...that doesn't mean my expectations are correct!
Do this in an EG project:
Process Flow A
data work.class;set sashelp.class;run;
data work.shoes;set sashelp.shoes;run;
In file --> Project Properties --> Code Submission, tick Allow parallel execution on the same server
Run the project. Because these datasets are so small, it may not be evident that the two executions ran in parallel.
But the 1400+ lines of clutter in the log should give some indication (would option nosource be good here?)
Also, EG doesn't try to display the datasets in a dataviewer window, another indicator that it ran on a separate workspace server.
This is a trivial example; in my "real code" I'm running about 30 program entries to populate permanent SQL Server tables, where there are no dependencies between the program entries. The parallel processing really improves processing time.
But I then want to run additional, "normal" processing after the parallel processing.
So, in the same EG project:
New Process Flow
Process Flow B
data work.stocks;set sashelp.stocks;run;
In the Program properties --> Code Submission, tick the "Customize code submission options", and leave both checkboxes unticked.
So this program entry runs on the "default" workspace server.
If you run this, no clutter in the log, and work.stocks displays in the dataviewer window.
Now from the dataviewer window, Filter and Sort, All Variables, Filter Stock = IBM
This code fails. It appears to be running on a new server IAW the project properties, so work.stocks does not exist in that server.
1) Should I expect Filter and Sort to obey the program properties, since the source dataset was created on the default workspace server, not via parallel processing?
2) What is the best way to configure Process Flow A to run all contained programs in parallel, but all other Process Flows sequentially? (Please don't say I have to RMB the 30 program entries in Process Flow A and configure for parallel execution).
I'll wait for feedback on my expectations before I create a SASWare Ballot entry to enhance parallel processing configuration to the Process Flow as well as Project and individual program entry level.
11-29-2017 07:54 AM
EG has the ability to manage multiple connections to a single "logical" SAS server (example, "SASApp"). It does this behind the scenes with the built-in Data Explorer feature, which submits multiple jobs to gather descriptive stats for a single data set. And then it also does this a bit more explicitly with the "parallel processing" flag for SAS programs.
However, there are pitfalls. WORK data sets are one issue, since each SAS workspace has its own WORK/temp space. Your jobs would need to be created to work with a common shared work library -- different than the built-in WORK -- to share the transient data generated among the different programs.
Another "gotcha" is something your admin might notice before you do: multiple SAS workspaces in use by a single EG session. If entire teams work this way, that could be a lot of workspaces -- the management of which the admin can't control centrally.
That's why SAS Enterprise Guide is also designed to work well with SAS Grid Computing, which provides the benefits of parallel processing in a way that a SAS admin can manage as well. If you use SAS Grid functions/macros within your SAS program, you can tell the SAS Grid which sections of code can run at the same time. SAS Enterprise Guide even provides a code analyzer that can annotate your program with these directives, making it easier to build a job that leverages the grid for parallel processing.
If you're looking for efficiency, I think that's the better option. If you're looking for simplicity -- but with parallel processing -- then you'll need to probably limit your scenarios to those that don't share WORK data and don't have too many parallel branches.
BTW, before parallel processing was an option in EG, many users simply opened multiple EG sessions and worked that way. Too many of those can get confusing, but that method still works too.
11-30-2017 08:19 PM
EG has the ability to manage multiple connections to a single "logical" SAS server (example, "SASApp"). It does this behind the scenes with the built-in Data Explorer feature, which submits multiple jobs to gather descriptive stats for a single data set.
I'm not quite sure what this means? Would multiple connections to a single logical SAS server run synchronously or asynchronously (parallel)? In any case, for my processing, I'm not interested in descriptive stats for a single data set, unless those descriptive stats add intelligence so that downstream EG tasks cause those tasks to run in the correct environment.
And then it also does this a bit more explicitly with the "parallel processing" flag for SAS programs.
This appears to do what I want. When I run multiple programs in parallel, it's clear from the "green boxes" and the start/end times that the code runs asynchronously. I assume this is spawning multiple SAS processes behind the scenes, analogous to MP_CONNECT but without all the explicit setup and process management (eg. explict WAITFOR statements). (So it's good that EG makes this easier than setting up connect scripts (which I don't control anyway), etc.)
Your jobs would need to be created to work with a common shared work library -- different than the built-in WORK -- to share the transient data generated among the different programs.
In my "real code", this is the case; the code that runs in parallel is doing explicit passthrough to SQL Server, with both source and target tables residing in SQL Server.
My issue is that:
1) For a program entry that has overridden the project default of parallel processing,
2) So it runs in the "current" workspace server,
3) And creates a work dataset in the current workspace server, then
4) I would expect a task based on that dataset, such as Filter and Sort, to be smart enough to run in the current workspace server, instead of the project default parallel processing. IMO that would be the user friendly thing to do; be smart enough to know that the work dataset is on the current server, so run downstream code that uses that work dataset as a source table on the current server. Of course, others may disagree.
That's why SAS Enterprise Guide is also designed to work well with SAS Grid Computing, ... If you're looking for efficiency, I think that's the better option.
We don't have SAS Grid Computing installed. I assume SAS Grid Computing significantly increases license costs?
Do those SAS Grid Computing macros work even if we only have a single SAS server, i.e. can I use them to programmatically execute program entries in parallel?
If you're looking for simplicity -- but with parallel processing -- then you'll need to probably limit your scenarios to those that don't share WORK data and don't have too many parallel branches.
Here is the use case that I wish EG (easily) supported. It's based around common ETL practices, of which I'm sure SAS R&D are aware. For those organizations that are too cheap to have a proper scheduler, and want to use EG for this sort of processing (even if just development), this could be a useful workaround.
By default NOT setup for parallel processing.
Process Flow: "Setup"
Set SAS options, SASAUTOS setting, allocate libraries, etc. Whatever global setup you need for your processing.
Process Flow: "Extract"
Since there are no dependencies on the extract processing, I want this entire process flow to run in parallel.
So I RMB the Process Flow properties (which IMO are currently pretty useless), and select the mythical "Run this process flow in parallel" option.
I do my delta extract (say any rows changed in the last week) from source to staging tables.
In my RDBMS environment, this would be explicit passthrough to the RDBMS, so SAS is just a client submitting code.
In a SAS environment, this would extract SAS data from SAS source datasets to SAS staging datasets. They would have to be permanent SAS datasets since the code is running in parallel.
I wish EG could "inject" one or more program entries and/or the process flow (i.e. "Setup") into the parallel jobs. A bit analogous to the Autoexec process flow option, but for parallel processing.
Alternatively, use the Link functionality to have a setup program that links to all the downstream parallel program. So you'd need the ability to have multiple links downstream from the setup program.
A workaround is to save the "Setup" code into one or more external files, and %include those files in the parallel extract programs.
Process Flow: "Transform"
This will have process order dependencies, so I need this code to run synchronously.
For example, I first prepare all the dimension tables, then prepare the fact tables with the surrogate keys / foreign keys from the dim tables.
Process Flow: "Load"
There are no dependencies between tables, so I want to run these programs in parallel.
I do a similar setup as per the "Extract" process flow.