Re: Threads in DS2, no time gain

JacobSimonsen · Posted 10-16-2014 07:01 AM

Dear Experts,

I wonder why I can't get a time gain by using threaded processing in DS2. Below I create a dataset with a group-variable, then I count the numbers in each Group by threaded by processing in DS2. It turns out to take far longer than if I do the same with normal datastep. I had expected that the threaded processing should result in far better performance, so I wonder why this does not happen.

By the way, this example is just to illustrate the problem. If "counting observations" was the real problem there are better methods to do that.

*the test dataset:;

data test;

do group=1 to 10;

do i=1 to 1000000;

output;

end;

run;

*Count observations with DS2:;

proc ds2 bypartition=no stimer;
thread read/overwrite=yes;
declare double count;
method init();
    count=0;
end;
method run();
   set test;
   by group i;
   if first.group then count=0;
   count+1;
   if last.group then output;
end;
endthread;

data abc/overwrite=yes;
keep group count;
declare thread read instance;
method run();
set from instance threads=4;
output;
end;
run;
quit;

NOTE: DS2 query used (Total process time):

real time 11.23 seconds

cpu time 26.59 seconds

*In comparison, an ordinary datastep:;

data abc;

set test;

by group i;

if first.group then count=0;

count+1;

run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.ABC has 10000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 6.89 seconds
cpu time 6.59 seconds

It can very a bit from one run to an other run, but basicly the same result came out each time. Also, changing bypartition to "yes" does not make any big change. And, Yes, I do have multiple processors on my server.

Kurt_Bremser · Posted 10-16-2014 08:56 AM

Parallel processing of I/O intensive tasks only makes sense if the I/O can be split unto physically separate devices.

As long as the data set in question is on one device, the threads will cause colliding requests on that device and ultimately slow the process down as compared to one single, often sequential scan through the data set.

That's why the SPDE engine works best with groups of disks aligned along the number of procs.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jakarman · Posted 10-16-2014 09:37 AM

It would become more interesting when you would a dataset in memory using the sasfile approach.

That would eliminate IO constraints. The most slow part with all processing.

You need to have a lot of memory but that should not be an issue these days.

The next one is the overhead in starting and maintaining threads. When that overhead is high compared to the processing it self, there you have another reason you will not improve overall speed.

---->-- ja karman --<-----

JacobSimonsen · Posted 10-16-2014 09:42 AM

I have tried that also, but it doesnt help. With "SASFILE test load" before proc ds2 I get almost same result:

NOTE: DS2 query used (Total process time):

real time 7.79 seconds

cpu time 24.11 seconds

jakarman · Posted 10-16-2014 09:54 AM

You are probably hitting the overhead starting maintain all processes.

Adding a more complicated function insyead of counting should prove that.

It is another dimension in causing load.

You have now a result the total response is almost equal but with the threading a lot of overhead is added 8 seconds finished am 25 seconds is used.

---->-- ja karman --<-----

jakarman · Posted 10-16-2014 09:54 AM

You are probably hitting the overhead starting maintain all processes.

Adding a more complicated function insyead of counting should prove that.

It is another dimension in causing load.

You have now a result the total response is almost equal but with the threading a lot of overhead is added 8 seconds finished am 25 seconds is used.

---->-- ja karman --<-----

FriedEgg · Posted 10-17-2014 02:01 PM

Your routine is not computationally complex enough to benefit from threading, you are really only adding overhead since the I/O is still in a single thread. If instead of a simple count you may try this example from

options cpucount=actual;
proc options option=cpucount;run;
libname base '/u/jaseco/tmp/base';
data base.jmaster;
do j = 1 to 10e6;
output;
end;
run;
proc ds2;
thread r /overwrite=yes;
dcl double count k x;
method run();
set base.jmaster;
count+1;
do k=1 to 80;/* Add some gratuitous computation! */
x = k/count + k/count + k/count;
end;
end;
method term();
OUTPUT;
end;
endthread;
run;
quit;
proc ds2;
data j1(overwrite=yes);
dcl thread r r_instance;
dcl double count total;
method run();
set from r_instance threads=1;
total+count;
end;
enddata;
run;
quit;
proc ds2;
data j2(overwrite=yes);
dcl thread r r_instance;
dcl double count total;
method run();
set from r_instance threads=2;
total+count;
end;
enddata;
run;
quit;
proc ds2;
data j4(overwrite=yes);
dcl thread r r_instance;
dcl double count total;
method run();
set from r_instance threads=4;
total+count;
end;
enddata;
run;
quit;
proc ds2;
data j8(overwrite=yes);
dcl thread r r_instance;
dcl double count total;
method run();
set from r_instance threads=8;
total+count;
end;
enddata;
run;
quit;
proc ds2;
data j16(overwrite=yes);
dcl thread r r_instance;
dcl double count total;
method run();
set from r_instance threads=16;
total+count;
end;
enddata;
run;
quit;
/****************************/
/* And read it in DATA step */
/****************************/
data jold;
set base.jmaster end=finish;
count+1;
do k=1 to 80;/* Add some gratuitous computation! */
x = k/count + k/count + k/count;
end;
if finish then output;
run;

JacobSimonsen · Posted 10-20-2014 03:54 AM

You are right - when the computational task is relative larger than the I/O task, then the gain by threaded processing can be huge even though I/O is not threaded.

I tried the code you suggested and I observe that the compuation (real) time decrease alot when number of threads is increased.

When 8 threads are used:

NOTE: PROCEDURE DS2 used (Total process time):

real time 2.40 seconds

cpu time 17.50 seconds

When the ordinary datastep is used:

NOTE: There were 10000000 observations read from the data set BASE.JMASTER.

NOTE: The data set WORK.JOLD has 1 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 42.09 seconds

cpu time 42.13 seconds

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away