BookmarkSubscribeRSS Feed
reecec
Fluorite | Level 6

Hi all,

I posted the same thread at stack exchange thought might get better help here hopefully.

 

I was trying to use proc ds2 to get some performance increases over the normal data step by using the multithreaded capability.
fred.testdata is a SPDE dataset containing 5 million observations. My code is below:

 

proc ds2;
   thread home_claims_thread / overwrite = yes;
   /*declare char(10) producttype;
   declare char(12) wrknat_clmtype;
   declare char(7) claimtypedet;
   declare char(1) event_flag;*/
   /*declare date week_ending having format date9.;*/
   method run();
      /*declare char(7) _week_ending;*/
      set fred.testdata;
      if claim = 'X' then claimtypedet= 'ABC';
      else if claim = 'Y' then claimtypedet= 'DEF';
      /*_week_ending = COMPRESS(exposmth,'M');
    week_ending = to_date(substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01');*/
   end;
   endthread;

data home_claims / overwrite = yes;
   declare thread home_claims_thread t; 
   method run();
      set from t threads=8;
   end;
enddata;
run;
quit;

 

I didn't include all IF statements and only included a few otherwise it would have taken up a few pages (you should get the idea hopefully). As the code currently is it works quite a fair bit faster than the normal data step however significant performance issues arise when any of the following happens:

  1. I uncomment any of the declare statements
  2. I include any numeric variables in fred.testdata (even without performing any calculations on the numeric variables)

My questions are:

  1. Is there any way to introduce numeric variables into fred.testdata without getting significant slowdowns which make DS2 way slower than the normal data step? (for this small table of 5 million rows including numeric column/s the real time is about 1 min 30 for ds2 and 20 seconds for normal data step). The actual full table is closer to 600 million rows. For example I would like to be able to do that week_ending conversion without it introducing a 5x performance penalty in run times.  I've noticed in "nmon" that as soon as I uncomment out the week_ending logic it somehow defaults back to using only 1 thread and as soon as I comment out week_ending it goes back up to using the full 8 threads. Run times for ds2 WITHOUT declare statements and numeric variables takes about 7 seconds
  2. Is there any way to compress the table in ds2 without having to do an additional data step to compress it?

 

Thank you

7 REPLIES 7
ChrisNZ
Tourmaline | Level 20

Out of curiosity, how many paths in your SPDE library definition?

Regarding compression,

data HOME_CLAIMS(compress=yes) / overwrite = yes;

should work.

reecec
Fluorite | Level 6

Hi ChrisNZ,

That compression thing worked thanks.  As for paths my libname is as follows:

 

LIBNAME fred SPDE '/work/saswork/fred';

 

I'm guessing this counts as just one path?

 

Regards,

Reece

ChrisNZ
Tourmaline | Level 20

Yes that's just one path.

Multiple threads typically need multiple I/O subsystems to improve performance.

Multiple threads hitting one disk just make random access requests instead of sequential requests, as all threads concurrently want a different piece of the file.

Unless your process is CPU-bound, which is the exception, you're typically better off reading sequentially from one disk.

If you have a large number of complex tests, CPU may well be the bottleneck, but you have to be sure, and then you must optimise the number of threads in order to ensure you don't create a new (much worse) bottleneck on the I/Os with too many threads.

 

Regarding the slowdown when adding a numeric variable, or when using some functions, I don't have enough experience with DS2 to comment. What is for sure is that the data step functions have been around for some time and have had time to be optimised. The DS2 functions are newer and may be in a rougher state. It would be sad if they were demonstrably much slower though.

 

 

 

reecec
Fluorite | Level 6

Thanks for your insights think I might stick with the normal data step at this stage.

 

In the original code the real time and CPU time were almost identical that's why I thought the step was CPU bound. 

 

Interestingly I tried the following sample code

 

proc ds2 ;

thread home_claims_thread / overwrite = yes;

method run();

set fred.base_home_exposure_mth;

 

%ifstatements;

 

_week_ending = COMPRESS(exposmth,'M');

 

week_ending = substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01';

 

week_ending2 = to_date(week_ending);

end;

endthread;

 

 

data home_claims (overwrite=yes) / compress = no ;

declare thread home_claims_thread t;

 

method run();

set from t threads=8;

end;

enddata;

run;

quit;

 

If I run the above code using 5 million row sample data it takes realtime =  1:29.76 and CPU time = 2:16.66

If I run the above code commenting out week_ending2 = to_date(week_ending); then it takes real time = 7.54 seconds and CPU time = 34 seconds.

 

Same code using normal data step takes 19 seconds for both real time and CPU time

ChrisNZ
Tourmaline | Level 20

Well it does seem that you are CPU-bound.

 

However, DS2 seems to have a lot of overhead compared to DS, and now you shine the light on other sub-optimal "features".

So it seems that you are back to just standard optimisation of the data step for now. 😞

 

If you are interested, here is a link to a similar discussion about DS2, and another link about discussing the speed gains that SPDE brings 

 

 

 

 

reecec
Fluorite | Level 6

Yep looks like back to the original data step.

 

After reading one of those threads you posted I kind of agree with Ksharp's verdict on ds2, had high hopes but quite disappointed with it (limited, not very user friendly, clunky to code in)

ChrisNZ
Tourmaline | Level 20

The times you mention are really slow. 

 

My data step runs in 3 seconds for 5 million rows.

 

I could replicate your issue with the numeric variable slowing things down in SPDE, but the declare statement being present makes no difference here.

 

 

libname SPEEDY spde "%sysfunc(pathname(WORK))" compress=binary;

%macro loop;
  %local i j hide_i hide_dcl;
  %do i=1 %to 2;
    %let hide_i=%sysfunc(ifc(&i=1,,*));
    data  SPEEDY.TESTDATA;   length A1-A50 $8;
      CLAIM = 'X'; do I=1 to 5e6; output; &hide_i. drop I; end;
    run;

    %do j=1 %to 4;
      %let hide_dcl=%sysfunc(ifc(&j>2,,*));

      %put =============== %sysfunc(ifc(&i=1,NUM,No NUM)) - %sysfunc(ifc(&j>2,DCL,No DCL)) ================== ;
      proc ds2;
        thread home_claims_thread / overwrite = yes;
        &hide_dcl. declare char(7) claimtypedet;
        method run();
           set SPEEDY.TESTDATA;
           if      CLAIM = 'X' then CLAIMTYPEDET= 'ABC';
           else if CLAIM = 'Y' then CLAIMTYPEDET= 'DEF';
        end;
        endthread;

       data HOME_CLAIMS (compress=yes)/ overwrite = yes;
          declare thread home_claims_thread t; 
          method run();
             set from t threads=8;
          end;
       enddata;
       run;
      quit;
                      
      data HOME_CLAIMS(compress=yes); 
        set   SPEEDY.TESTDATA;
        if      CLAIM = 'X' then CLAIMTYPEDET= 'ABC';
        else if CLAIM = 'Y' then CLAIMTYPEDET= 'DEF';
      run;

    %end;
  %end;
%mend;
%loop;

 

Real time/ CPU time in seconds:

 

   Num No Num
DS2 + dcl 12/18 7/9
DS2 No dcl 13/18 9/9
DS 3.5/3.5 3.5/3.5

 

My data step is always much faster as I only use one path, so multi-threading actually slows things down.

 

SAS INNOVATE 2024

Innovate_SAS_Blue.png

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Get the $99 certification deal.jpg

 

 

Back in the Classroom!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1140 views
  • 1 like
  • 2 in conversation