topic Re: DATA Step BY-Group Processing: Compute Server vs CAS in SAS Programming

DATA Step BY-Group Processing: Compute Server vs CAS

Luhan — Thu, 30 Jun 2022 12:14:30 GMT

Hi,

I experience very long run times on the CAS Server when using BY-Group processing in a DATA Step (even with uniformly distributed values across few or many threads). Using FED SQL instead, or even running PROC SORT + DATA Step on the Compute Server seems to be always faster. The issue can be replicated in the Virtual Lab of a SAS course as follows:

Step 1: Create the same data set containing names with 10 million rows in both environments (work.test + casuser.test).

data
	work.test
	casuser.test
;
	call streaminit(100);
	do i = 1 to 10000000;
		random = rand("Integer", 1, 5);
		if 1 <= random <= 3 then name = "R. Pearlman";
		else if random  = 4 then name = "J. McNulty";
		else if random  = 5 then name = "C. Daniels";
		output; 
	end;
	keep name;
run;

Step 2: Count the number of occurrences for each name on the Compute Server (this requires a SORT Step):

proc sort
	data = test
	out  = test2
;
	by name;
run;

data test2;
	set test2;
	by name;
	if first.name then count = 0;
	count +1;
	if last.name;
run;

In total, this code needed 1.53 seconds to run. CPU time is 4.17 which tells me that some parallelization took place here:

Step 3: Count the number of occurrences for each name on the CAS Server (no sorting required):

data casuser.test2;
	set casuser.test;
	by name;
	if first.name then count = 0;
	count +1;
	if last.name;
run;

This code took 11.28 seconds in total but used only 0.02 seconds of CPU time:

Questions:

Why is it that the BY-Group processing is so much faster on the Compute Server?
My guess would be that distributing rows across threads according to BY variables is very slow on the CAS Server.
Why is it that the CPU time is so low on the CAS server compared to the real time and that increasing the amount of data almost only affects real time?
I even expected the CPU time to be larger than real time because of parallelization on the CAS server.

Thanks,

Luhan

Re: DATA Step BY-Group Processing: Compute Server vs CAS

mkeintz — Thu, 30 Jun 2022 16:20:49 GMT

Questions:

Why is it that the BY-Group processing is so much faster on the Compute Server?
My guess would be that distributing rows across threads according to BY variables is very slow on the CAS Server.

Why is it that the CPU time is so low on the CAS server compared to the real time and that increasing the amount of data almost only affects real time?
I even expected the CPU time to be larger than real time because of parallelization on the CAS server.

Thanks,

Luhan

Is your test a little too artificial to be informative? Your test dataset is very narrow, making the preparatory PROC SORT very cheap. I wonder whether the apparent real-time superiority of the SORT followed by the non-CAS data step would still show up with a fat file.

I've never used CAS, so this is said in complete ignorance. For speed test purposes, why not run the CAS data once with, and once without BY group, as in

data _null_;
  set casuser.test;
  count+1;
run;

data _null_;
  set casuser.test;
  by name;
  count+1;
run;

Yes, I know that it doesn't produce the results you want, but it does tell you the impact of a BY statement.

Question: does a BY statement cause CAS to pre-sort the unordered data, or does CAS just create threads based on by-values? If it's the latter, then CAS will never satisfy any of the "if last.name;" filters until the end of the data set. That's probably a lot of overhead, unneeded by the single thread approach applied against a sorted data set. Yes, there are only three values for name in your sample, but maybe the fixed cost of maintaining a dynamic set of by groups is big, no matter the cardinality.

Re: DATA Step BY-Group Processing: Compute Server vs CAS

Luhan — Thu, 30 Jun 2022 19:33:35 GMT

Thanks for your reply and suggestion to isolate the BY statement effect. It seems to support my guess that DATA step processing on CAS slows down significantly when the data is distributed according to BY variables.

Step 1: BY or not to BY with original sample data

Without a BY statement, CAS evenly distributes the data across the 32 available threads of my session in no time:

data _null_;
	set casuser.test end=eof;

	if eof;
	put "thread:" _threadid_ " obs:" _n_;
run;

When using a BY statement, the distribution is done across 3 threads (because the BY variable has 3 levels) and this slows things down significantly:

data _null_;
	set casuser.test end=eof;

	by name;

	if eof;
	put "thread:" _threadid_ " obs:" _n_;
run;

Step 2: BY or not to BY with uniformly distributed sample data

I thought using a BY variable that is uniformly distributed with 32 levels would take full advantage of the 32 threads:

data casuser.test; call streaminit(100); do i=1 to 10000000; random = rand("Integer", 1, 32); name = catx(" ", "Name number", random); output; end; run;

Surprisingly, the performance got even worse - even though (or because) 23 instead of 3 threads were used:

data _null_; set casuser.test end=eof; by name; if eof; put "thread:" _threadid_ " obs:" _n_; run;

Step 3: FED SQL

Just for comparison reasons: This is the run time of FED SQL step on the same data, grouping by name, additionally applying a summary function, and generating a report:

proc fedsql sessref=casauto;
	select name, count(*)
	from casuser.test
	group by name;
quit;

Conclusion (so far):

At least on my setups (Viya SMP), the performance of DATA steps decreases significantly when BY variables influence the distribution of the data. Using FED SQL or CAS actions instead seems to be the way to go in case of BY-Group processing on CAS.

Re: DATA Step BY-Group Processing: Compute Server vs CAS

SASKiwi — Thu, 30 Jun 2022 20:02:45 GMT

What version of Viya are you using? Have you tracked this issue with SAS Tech Support and if so what was the response? If not then I suggest you do so and then add their response to this post.