Re: sample code to run a partitioned data sorting using MP_Connect?

ZRick · Posted 08-24-2012 02:42 PM

I have a simulation data set of over 200 million rows, I have to sort it utilizing proc sort and MP_Connect

Can someone help me with the code template?

Ksharp · Posted 08-26-2012 11:10 PM

So huge table ? If you have enough memory , I would choose HashTable. But it is usually not really.

How about spliting it into several small tables ?

Ksharp

DanielSantos · Posted 08-27-2012 06:52 AM

The SORT procedure is multithreaded since v9, THREADS option should by set by default, same for CPUCOUNT option to the number of CPUs in your system. Unless your system has no support for multithreading or SAS is under v9, the SORT procedure is already pretty much enabled for parallelism.

More on SORT procedure here:

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473663.htm00

Should you still need to use de MP Connect feature, here's what I use:

options autosignon=yes sascmd="!sascmd";

rsubmit process=thread1 wait=no;

/*** code here ***/

endrsubmit;

rsubmit process=thread2 wait=no;

/*** code here ***/

endrsubmit;

rsubmit process=thread3 wait=no;

/*** code here ***/

endrsubmit;

waitfor _all_ thread1 thread2 thread3;

options autosignon=no;

And a lot more here:

http://support.sas.com/rnd/scalability/tricks/index.html

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Patrick · Posted 08-28-2012 06:25 AM

As Daniel says: Proc Sort is multithreaded.

I don't see how MP_Connect could help you. What would help (and I've seen this recently) is if your source data is stored using the SPDE engine as this will allow for multithreading. Even better would be to define the SPDE library using several disks as this would give you better I/O.

Using a hash:

With 200M rows only if you have a LOT of addressable memory. I've made a while ago a test on my Win7 8GB laptop how many keys I'm able to store in a hash and I remember that SAS on my laptop crashed with around 200M distinct values (numeric).

Not sure what you need to do but it might also be worth thinking about indexing your dataset instead of sorting. Again: Using the SPDE enging would be beneficial. I've seen cases where people indexed huge datasets (using multiple columns) and then were astonished that it didn't help a lot. What happened was that the size of the index got to something like 20% of the table - and with all the overhead of first reading the index and then the data in the table the overall performance didn't get much better (bottleneck was I/O).

So 200M rows stored in a standard SAS table will be a challenge in any ways. Make sure you've got it stored on the fastest disk available and if you can in any way influence the settings in your environment: Make sure that the work space and utilloc (where intermediary sort "slices" get stored) are not on the same disk (same controler). It's very very likely all about I/O for you.

One last thing: usin PROC SORT .... NOEQUALS allows SAS to use a more efficient sort algorithm (not sure if this is not already the default - but it doesn't hurt to set the option explicitly).

Patrick · Posted 08-28-2012 06:48 AM

What do you mean by "partitioned data"? Where is the data stored (SAS or a database - and if database: Which one and which version and is the table partitioned and how?).

I found this link which is quite interesting and possibly what you have in mind: "Piping Between Data Step and Proc Sort on SMP Machine", http://support.sas.com/rnd/scalability/tricks/connect.html#pipds

Message was edited by: Patrick

sample code to run a partitioned data sorting using MP_Connect?