BookmarkSubscribeRSS Feed
BruceBrad
Lapis Lazuli | Level 10

I have a datastep with the following structure:

- Merge two large files (with a single BY variable)

- Do some datastep calculations

- Write out another datafile

This is running on SAS 9.2 on a Win2008 x64 server. When it runs, one of the 32 cores on the server is at 100% CPU, with the others unused - so presume it could benefit from parallelisation.

Is there an easy way to do this in SAS?  I've seen some mention of proc sql exploiting multiple threads internally, so could I use this?  Or has anyone written macro code to do the splitting and joining required?

10 REPLIES 10
art297
Opal | Level 21

Bruce,

I've never tried it, but you might find the concepts included in the following paper helpful in your attempt:

http://support.sas.com/resources/papers/proceedings10/109-2010.pdf

SASKiwi
PROC Star

You can parallelise tasks by splitting data logically and process it in separate SAS server sessions, then combine the results afterwards. Check the SAS documentation for asynchronous processing using SAS/CONNECT or other posts on this topic. Also consider other performance enhancing techniques like data compression, indexing, tuning i/o buffers etc.

FriedEgg
SAS Employee

The SPDE library engine works very well for this.  As long as the I/O subsystem on the machine is sufficient enough to support it.  The SPDE engine also provides a number of other benefits, like being able to split data sets over multiple I/O channels and disks, I utilize the engine for extremely large data.  Tables in the trillions of rows for example.  Here are some metrics from a job I ran just the other day:

I extract a table from Oracle Data Warehouse.  The data is column dimensioned.  12.3T rows and 4 columns.

Extract in 8 threads to single dataset using dbslice... 2 hours

Proc sort entire set by primary and secondary key... 5 hours

Datastep to transpose data using by group processing and arrays... 9 hours

The resulting dataset in a row dimensioned SAS table with 200M rows and 5500 columns.

Without utilizing the SPDE library engine these processes took so long that they were basically deemed useless...  Most significantly the final transpose step took more that 3 full days to run.

The machine used to create the above numbers is a 64bit Linux, 8 cores, 42 GB ram, storage is a 24 disk DAS with two 12 disk raid 5 groups that are then in a raid 0 configuration.  SPDE engine is using 8 data store points and 24 max threads.

In short, I LOVE the SPDE library engine!  it is an incredibly useful tool for big data.

Other methods for data step multithreading is to split the data into logical groups as others said and utilizing the MP-CONNECT tools (available if you have SAS/CONNECT license) then run multiple asyn steps.  I personally prefer the simplicity the use of the SPDE engine allows for comparatively.

BruceBrad
Lapis Lazuli | Level 10

Is SPDE part of base SAS or is it licenced as a separate product? How can I tell if I have a licence for it?

art297
Opal | Level 21

I would think you could tell by trying to run something like:

proc spdo;run;

If you don't have it I would guess the log would inform you of that fact.

FriedEgg
SAS Employee

The SPDE library engine is a included feature as part of you base license, which is quite nice. It is somewhat of an introductory product to the SPDS product.

SASJedi
SAS Super FREQ

This paper has come sample macro code that could prove helpful:

http://www2.sas.com/proceedings/forum2007/036-2007.pdf

Check out my Jedi SAS Tricks for SAS Users
BruceBrad
Lapis Lazuli | Level 10

Thanks for the suggestions. I hadn't seen the 2010 paper - it is looking close to what I want. Shame this feature isn't inbuilt into the SAS datastep code.

The 2010 paper suggests putting the 'normal' datastep code into a separate file so it can be read by multiple SAS processes. It would be neater if this code were in a macro in the main program (like the the SAS bootstrap macros) - that way I could also use global macro variables and macros within the data step. To do this, I guess I would need to pass the global macro environment of the main process (eg defined macros and global macro variables) to the spawned processes. Is this possible in SAS?

SASKiwi
PROC Star

Yes. Check out the %SYSRPUT and %SYSLPUT macro statements. Be careful with the use of macros between main and spawned processes as they have to be submitted in all processes to be available. I still think %include is a cleaner proposition, but that may depend on the the complexity of your processing.

Peter_C
Rhodochrosite | Level 12

Bruce

 

as you say  "I guess I would need to pass the global macro environment of the main process (eg defined macros and global macro variables) to the spawned processes"

  

It is just a matter of passing one parameter to each subtask, to indentify what segment that subtask needs to do +

plus making the macro pool addressible by all subtasks - or having duplicates of the macro pool privately available to each separate subtask.

The single parameter serves multiple purposes:

     the subtask "task-id",

     the subsetting value for a where clause,

     and/or, a subscript on the name of the input and output data sets.

All other parameters like "global macro variables" will be common to all, so can be a part of that "macro pool"

Keeping the parameter list "singular" seems best.

  

peterC

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1864 views
  • 3 likes
  • 6 in conversation