08-16-2011 08:48 AM
I have a datastep with the following structure:
- Merge two large files (with a single BY variable)
- Do some datastep calculations
- Write out another datafile
This is running on SAS 9.2 on a Win2008 x64 server. When it runs, one of the 32 cores on the server is at 100% CPU, with the others unused - so presume it could benefit from parallelisation.
Is there an easy way to do this in SAS? I've seen some mention of proc sql exploiting multiple threads internally, so could I use this? Or has anyone written macro code to do the splitting and joining required?
08-16-2011 09:29 AM
I've never tried it, but you might find the concepts included in the following paper helpful in your attempt:
08-16-2011 06:46 PM
You can parallelise tasks by splitting data logically and process it in separate SAS server sessions, then combine the results afterwards. Check the SAS documentation for asynchronous processing using SAS/CONNECT or other posts on this topic. Also consider other performance enhancing techniques like data compression, indexing, tuning i/o buffers etc.
08-17-2011 05:36 PM
The SPDE library engine works very well for this. As long as the I/O subsystem on the machine is sufficient enough to support it. The SPDE engine also provides a number of other benefits, like being able to split data sets over multiple I/O channels and disks, I utilize the engine for extremely large data. Tables in the trillions of rows for example. Here are some metrics from a job I ran just the other day:
I extract a table from Oracle Data Warehouse. The data is column dimensioned. 12.3T rows and 4 columns.
Extract in 8 threads to single dataset using dbslice... 2 hours
Proc sort entire set by primary and secondary key... 5 hours
Datastep to transpose data using by group processing and arrays... 9 hours
The resulting dataset in a row dimensioned SAS table with 200M rows and 5500 columns.
Without utilizing the SPDE library engine these processes took so long that they were basically deemed useless... Most significantly the final transpose step took more that 3 full days to run.
The machine used to create the above numbers is a 64bit Linux, 8 cores, 42 GB ram, storage is a 24 disk DAS with two 12 disk raid 5 groups that are then in a raid 0 configuration. SPDE engine is using 8 data store points and 24 max threads.
In short, I LOVE the SPDE library engine! it is an incredibly useful tool for big data.
Other methods for data step multithreading is to split the data into logical groups as others said and utilizing the MP-CONNECT tools (available if you have SAS/CONNECT license) then run multiple asyn steps. I personally prefer the simplicity the use of the SPDE engine allows for comparatively.
08-17-2011 09:21 PM
The SPDE library engine is a included feature as part of you base license, which is quite nice. It is somewhat of an introductory product to the SPDS product.
08-17-2011 02:11 AM
Thanks for the suggestions. I hadn't seen the 2010 paper - it is looking close to what I want. Shame this feature isn't inbuilt into the SAS datastep code.
The 2010 paper suggests putting the 'normal' datastep code into a separate file so it can be read by multiple SAS processes. It would be neater if this code were in a macro in the main program (like the the SAS bootstrap macros) - that way I could also use global macro variables and macros within the data step. To do this, I guess I would need to pass the global macro environment of the main process (eg defined macros and global macro variables) to the spawned processes. Is this possible in SAS?
08-17-2011 05:45 PM
Yes. Check out the %SYSRPUT and %SYSLPUT macro statements. Be careful with the use of macros between main and spawned processes as they have to be submitted in all processes to be available. I still think %include is a cleaner proposition, but that may depend on the the complexity of your processing.
08-18-2011 05:38 AM
as you say "I guess I would need to pass the global macro environment of the main process (eg defined macros and global macro variables) to the spawned processes"
It is just a matter of passing one parameter to each subtask, to indentify what segment that subtask needs to do +
plus making the macro pool addressible by all subtasks - or having duplicates of the macro pool privately available to each separate subtask.
The single parameter serves multiple purposes:
the subtask "task-id",
the subsetting value for a where clause,
and/or, a subscript on the name of the input and output data sets.
All other parameters like "global macro variables" will be common to all, so can be a part of that "macro pool"
Keeping the parameter list "singular" seems best.