BookmarkSubscribeRSS Feed
JacobSimonsen
Barite | Level 11

Dear Experts,

I wonder why I can't get a time gain by using threaded processing in DS2. Below I create a dataset with a group-variable, then I count the numbers in each Group by threaded by processing in DS2. It turns out to take far longer than if I do the same with normal datastep. I had expected that the threaded processing should result in far better performance, so I wonder why this does not happen.

By the way, this example is just to illustrate the problem. If "counting observations" was the real problem there are better methods to do that.

*the test dataset:;

data test;

  do group=1 to 10;

    do i=1 to 1000000;

output;

end;

  end;

run;

*Count observations with DS2:;

proc ds2 bypartition=no stimer;
  thread read/overwrite=yes;
  declare double count;
  method init();
    count=0;
  end;
  method run();
   set test;
   by group i;
   if first.group then count=0;
   count+1;
   if last.group then output;
  end;
  endthread;

  data abc/overwrite=yes;
keep group count;
  declare thread read instance;
method run();
   set from instance threads=4;
   output;
end;
  run;
quit;

NOTE: DS2 query used (Total process time):

      real time           11.23 seconds

      cpu time            26.59 seconds

*In comparison, an ordinary datastep:;

data abc;

  set test;

  by group i;

  if first.group then count=0;

  count+1;

run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.ABC has 10000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           6.89 seconds
      cpu time            6.59 seconds

It can very a bit from one run to an other run, but basicly the same result came out each time. Also, changing bypartition to "yes" does not make any big change. And, Yes, I do have multiple processors on my server.

7 REPLIES 7
Kurt_Bremser
Super User

Parallel processing of I/O intensive tasks only makes sense if the I/O can be split unto physically separate devices.

As long as the data set in question is on one device, the threads will cause colliding requests on that device and ultimately slow the process down as compared to one single, often sequential scan through the data set.

That's why the SPDE engine works best with groups of disks aligned along the number of procs.

jakarman
Barite | Level 11

It would become more interesting when you would a dataset in memory using the sasfile approach.

That would eliminate IO constraints. The most slow part with all processing.

You need to have a lot of memory but that should not be an issue these days.

The next one is the overhead in starting and maintaining threads. When that overhead is high compared to the processing it self, there you have another reason you will not improve overall speed.

---->-- ja karman --<-----
JacobSimonsen
Barite | Level 11

I have tried that also, but it doesnt help. With "SASFILE test load" before proc ds2 I get almost same result:

NOTE: DS2 query used (Total process time):

      real time          7.79 seconds

      cpu time            24.11 seconds

jakarman
Barite | Level 11

You are probably hitting the overhead starting maintain all processes.

Adding a more complicated function insyead of counting should prove that.

It is another dimension in causing load.

You have now a result the total response is almost equal but with the threading a lot of overhead is added 8 seconds finished am 25 seconds is used.

---->-- ja karman --<-----
jakarman
Barite | Level 11

You are probably hitting the overhead starting maintain all processes.

Adding a more complicated function insyead of counting should prove that.

It is another dimension in causing load.

You have now a result the total response is almost equal but with the threading a lot of overhead is added 8 seconds finished am 25 seconds is used.

---->-- ja karman --<-----
FriedEgg
SAS Employee

Your routine is not computationally complex enough to benefit from threading, you are really only adding overhead since the I/O is still in a single thread.  If instead of a simple count you may try this example from

  1. options cpucount=actual; 
  2. proc options option=cpucount;run; 
  3. libname base '/u/jaseco/tmp/base'; 
  4. data base.jmaster; 
  5.   do j = 1 to 10e6; 
  6.     output; 
  7.   end; 
  8. run; 
  9. proc ds2; 
  10.   thread r /overwrite=yes; 
  11.     dcl double count k x; 
  12.     method run(); 
  13.       set base.jmaster; 
  14.       count+1; 
  15.       do k=1 to 80;/* Add some gratuitous computation! */ 
  16.         x = k/count + k/count + k/count; 
  17.       end; 
  18.     end; 
  19.     method term(); 
  20.       OUTPUT; 
  21.     end; 
  22.   endthread; 
  23.   run; 
  24. quit; 
  25. proc ds2; 
  26.   data j1(overwrite=yes); 
  27.     dcl thread r r_instance; 
  28.     dcl double count total; 
  29.     method run(); 
  30.       set from r_instance threads=1; 
  31.       total+count; 
  32.     end; 
  33.   enddata; 
  34.   run; 
  35. quit; 
  36. proc ds2; 
  37.   data j2(overwrite=yes); 
  38.     dcl thread r r_instance; 
  39.     dcl double count total; 
  40.     method run(); 
  41.       set from r_instance threads=2; 
  42.       total+count; 
  43.     end; 
  44.   enddata; 
  45.   run; 
  46. quit; 
  47. proc ds2; 
  48.   data j4(overwrite=yes); 
  49.     dcl thread r r_instance; 
  50.     dcl double count total; 
  51.     method run(); 
  52.       set from r_instance threads=4; 
  53.       total+count; 
  54.     end; 
  55.   enddata; 
  56.   run; 
  57. quit; 
  58. proc ds2; 
  59.   data j8(overwrite=yes); 
  60.     dcl thread r r_instance; 
  61.     dcl double count total; 
  62.     method run(); 
  63.       set from r_instance threads=8; 
  64.       total+count; 
  65.     end; 
  66.   enddata; 
  67.   run; 
  68. quit; 
  69. proc ds2; 
  70.   data j16(overwrite=yes); 
  71.     dcl thread r r_instance; 
  72.     dcl double count total; 
  73.     method run(); 
  74.       set from r_instance threads=16; 
  75.       total+count; 
  76.     end; 
  77.   enddata; 
  78.   run; 
  79. quit; 
  80. /****************************/ 
  81. /* And read it in DATA step */ 
  82. /****************************/ 
  83. data jold; 
  84.   set base.jmaster end=finish; 
  85.   count+1; 
  86.   do k=1 to 80;/* Add some gratuitous computation! */ 
  87.     x = k/count + k/count + k/count; 
  88.   end; 
  89.   if finish then output; 
  90. run; 
JacobSimonsen
Barite | Level 11

You are right - when the computational task is relative larger than the I/O task, then the gain by threaded processing can be huge even though I/O is not threaded.

I tried the code you suggested and I observe that the compuation (real) time decrease alot when number of threads is increased.

When 8 threads are used:

NOTE: PROCEDURE DS2 used (Total process time):

      real time           2.40 seconds

      cpu time            17.50 seconds

When the ordinary datastep is used:

NOTE: There were 10000000 observations read from the data set BASE.JMASTER.

NOTE: The data set WORK.JOLD has 1 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           42.09 seconds

      cpu time            42.13 seconds

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1861 views
  • 5 likes
  • 4 in conversation