I have a simulation data set of over 200 million rows, I have to sort it utilizing proc sort and MP_Connect
Can someone help me with the code template?
So huge table ? If you have enough memory , I would choose HashTable. But it is usually not really.
How about spliting it into several small tables ?
Ksharp
The SORT procedure is multithreaded since v9, THREADS option should by set by default, same for CPUCOUNT option to the number of CPUs in your system. Unless your system has no support for multithreading or SAS is under v9, the SORT procedure is already pretty much enabled for parallelism.
More on SORT procedure here:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473663.htm00
Should you still need to use de MP Connect feature, here's what I use:
options autosignon=yes sascmd="!sascmd";
rsubmit process=thread1 wait=no;
/*** code here ***/
endrsubmit;
rsubmit process=thread2 wait=no;
/*** code here ***/
endrsubmit;
rsubmit process=thread3 wait=no;
/*** code here ***/
endrsubmit;
waitfor _all_ thread1 thread2 thread3;
options autosignon=no;
And a lot more here:
http://support.sas.com/rnd/scalability/tricks/index.html
Cheers from Portugal.
Daniel Santos @ www.cgd.pt
As Daniel says: Proc Sort is multithreaded.
I don't see how MP_Connect could help you. What would help (and I've seen this recently) is if your source data is stored using the SPDE engine as this will allow for multithreading. Even better would be to define the SPDE library using several disks as this would give you better I/O.
Using a hash:
With 200M rows only if you have a LOT of addressable memory. I've made a while ago a test on my Win7 8GB laptop how many keys I'm able to store in a hash and I remember that SAS on my laptop crashed with around 200M distinct values (numeric).
Not sure what you need to do but it might also be worth thinking about indexing your dataset instead of sorting. Again: Using the SPDE enging would be beneficial. I've seen cases where people indexed huge datasets (using multiple columns) and then were astonished that it didn't help a lot. What happened was that the size of the index got to something like 20% of the table - and with all the overhead of first reading the index and then the data in the table the overall performance didn't get much better (bottleneck was I/O).
So 200M rows stored in a standard SAS table will be a challenge in any ways. Make sure you've got it stored on the fastest disk available and if you can in any way influence the settings in your environment: Make sure that the work space and utilloc (where intermediary sort "slices" get stored) are not on the same disk (same controler). It's very very likely all about I/O for you.
One last thing: usin PROC SORT .... NOEQUALS allows SAS to use a more efficient sort algorithm (not sure if this is not already the default - but it doesn't hurt to set the option explicitly).
What do you mean by "partitioned data"? Where is the data stored (SAS or a database - and if database: Which one and which version and is the table partitioned and how?).
I found this link which is quite interesting and possibly what you have in mind: "Piping Between Data Step and Proc Sort on SMP Machine", http://support.sas.com/rnd/scalability/tricks/connect.html#pipds
Message was edited by: Patrick
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.