Hello Chris.
Very good analysis the one you provided.
Although very off the original topic, this is getting very interesting.
I find it very nice and valuable to have some more academic discussion here.
I've ran your tests (3 runs) on my machine (AIX box with 12 lcpu and 20GB of RAM) and added 4 more variants to the solution (code posted at the end):
+ Summary Hash technique.
+ SORT(one var only)+DATASTEP with Piping Parallelism
+ MP(4 parts) SORT(one var only)+DATASTEP with Piping Parallelism
+ MP(4 parts) SORT(one var only)+DATASTEP with no Piping Parallelism
Results for the same data of your trials:
SORT(1VAR)+STEP time=32s, 32s, 32s
SQL(2VAR)+STEP time=58s, 57s, 60s
PIPED SORT(2VAR)+STEP time=31s, 30s, 31s
SUMMARY time=5s, 5s, 5s
PIPED SORT(1VAR)+STEP time=26s, 27s, 27s
PIPED/MP 4xSORT(1VAR)+STEP time=24s, 25s, 24s
HASH SUM time=29s, 30s, 32s
MP 4xSORT(1VAR)+STEP time=23s, 25s, 24s
Conclusion.
SQL worst.
Hash not brilliant but, keeps up with the average.
Piped Sort by 1 var does a better job than by 2.
MP(4 parts) good, but pretty much the same with or without piping.
And the gold medal goes for Summary by far.
Now, beside the amount of data, two more factors should be considered.
Type of data - the generated sample produces only 7 distinct values for B.
System load - some techniques (MP/piping and MP+piping) could be greatly influenced by the system load at execution.
So I slightly modified your sample to produce about 1000 distinct values for B.
Then I ran the same tests (3 runs), and here are the results:
SORT(1VAR)+STEP time=59s, 40s, 41s
SQL(2VAR)+STEP time=96s, 68s, 89s
PIPED SORT(2VAR)+STEP time=215s, 39s, 39s
SUMMARY time=63s, 59s, 70s
PIPED SORT(1VAR)+STEP time=45s, 44s, 49s
PIPED/MP 4xSORT(1VAR)+STEP time=37s, 37s, 68s
HASH SUM time=49s, 38s, 36s
MP 4xSORT(1VAR)+STEP time=32s, 37s, 177s
Conclusion.
SQL still worst.
Hash not brilliant but, again keeps up with the average.
MP/Piping processing produces the best marks, but are highly dependent on the system load, and so less stable in performance.
Summary has a lot of more to handle, and goes down on the ranking, very near to SQL performance.
So about summing data with SAS, I would conclude by this results, that there is no particular recommended technique. Still, avoid SQL unless for small amount of data.
Hash, not being the best, will assure you a good performance, but careful with the memory limitations (roughly I would say a limit of 2,000,000 of distinct elements would be reasonable).
Then we have PROC SUMMARY/MEANS which on v9 with its threading support as got a lot better, but, be aware that performance will dramatically reduce with the increase of data, and particularly with the number of distinct elements your grouping.
Multi-Processing with or without Piping Parallism, pushes away your processing power, but are alway very dependent on your system load, and results may vary from best to worst.
Chris, thank you for pushing this subject a little further, I find it great and mentally challenging to have from time to time this kind of discussion.
Cheers from Portugal.
Daniel Santos @
www.cgd.pt.