Hi all,
I posted the same thread at stack exchange thought might get better help here hopefully.
I was trying to use proc ds2 to get some performance increases over the normal data step by using the multithreaded capability.
fred.testdata is a SPDE dataset containing 5 million observations. My code is below:
proc ds2;
thread home_claims_thread / overwrite = yes;
/*declare char(10) producttype;
declare char(12) wrknat_clmtype;
declare char(7) claimtypedet;
declare char(1) event_flag;*/
/*declare date week_ending having format date9.;*/
method run();
/*declare char(7) _week_ending;*/
set fred.testdata;
if claim = 'X' then claimtypedet= 'ABC';
else if claim = 'Y' then claimtypedet= 'DEF';
/*_week_ending = COMPRESS(exposmth,'M');
week_ending = to_date(substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01');*/
end;
endthread;
data home_claims / overwrite = yes;
declare thread home_claims_thread t;
method run();
set from t threads=8;
end;
enddata;
run;
quit;
I didn't include all IF statements and only included a few otherwise it would have taken up a few pages (you should get the idea hopefully). As the code currently is it works quite a fair bit faster than the normal data step however significant performance issues arise when any of the following happens:
My questions are:
Thank you
Out of curiosity, how many paths in your SPDE library definition?
Regarding compression,
data HOME_CLAIMS(compress=yes) / overwrite = yes;
should work.
Hi ChrisNZ,
That compression thing worked thanks. As for paths my libname is as follows:
LIBNAME fred SPDE '/work/saswork/fred';
I'm guessing this counts as just one path?
Regards,
Reece
Yes that's just one path.
Multiple threads typically need multiple I/O subsystems to improve performance.
Multiple threads hitting one disk just make random access requests instead of sequential requests, as all threads concurrently want a different piece of the file.
Unless your process is CPU-bound, which is the exception, you're typically better off reading sequentially from one disk.
If you have a large number of complex tests, CPU may well be the bottleneck, but you have to be sure, and then you must optimise the number of threads in order to ensure you don't create a new (much worse) bottleneck on the I/Os with too many threads.
Regarding the slowdown when adding a numeric variable, or when using some functions, I don't have enough experience with DS2 to comment. What is for sure is that the data step functions have been around for some time and have had time to be optimised. The DS2 functions are newer and may be in a rougher state. It would be sad if they were demonstrably much slower though.
Thanks for your insights think I might stick with the normal data step at this stage.
In the original code the real time and CPU time were almost identical that's why I thought the step was CPU bound.
Interestingly I tried the following sample code
proc ds2 ;
thread home_claims_thread / overwrite = yes;
method run();
set fred.base_home_exposure_mth;
%ifstatements;
_week_ending = COMPRESS(exposmth,'M');
week_ending = substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01';
week_ending2 = to_date(week_ending);
end;
endthread;
data home_claims (overwrite=yes) / compress = no ;
declare thread home_claims_thread t;
method run();
set from t threads=8;
end;
enddata;
run;
quit;
If I run the above code using 5 million row sample data it takes realtime = 1:29.76 and CPU time = 2:16.66
If I run the above code commenting out week_ending2 = to_date(week_ending); then it takes real time = 7.54 seconds and CPU time = 34 seconds.
Same code using normal data step takes 19 seconds for both real time and CPU time
Well it does seem that you are CPU-bound.
However, DS2 seems to have a lot of overhead compared to DS, and now you shine the light on other sub-optimal "features".
So it seems that you are back to just standard optimisation of the data step for now. 😞
If you are interested, here is a link to a similar discussion about DS2, and another link about discussing the speed gains that SPDE brings
Yep looks like back to the original data step.
After reading one of those threads you posted I kind of agree with Ksharp's verdict on ds2, had high hopes but quite disappointed with it (limited, not very user friendly, clunky to code in)
The times you mention are really slow.
My data step runs in 3 seconds for 5 million rows.
I could replicate your issue with the numeric variable slowing things down in SPDE, but the declare statement being present makes no difference here.
libname SPEEDY spde "%sysfunc(pathname(WORK))" compress=binary;
%macro loop;
%local i j hide_i hide_dcl;
%do i=1 %to 2;
%let hide_i=%sysfunc(ifc(&i=1,,*));
data SPEEDY.TESTDATA; length A1-A50 $8;
CLAIM = 'X'; do I=1 to 5e6; output; &hide_i. drop I; end;
run;
%do j=1 %to 4;
%let hide_dcl=%sysfunc(ifc(&j>2,,*));
%put =============== %sysfunc(ifc(&i=1,NUM,No NUM)) - %sysfunc(ifc(&j>2,DCL,No DCL)) ================== ;
proc ds2;
thread home_claims_thread / overwrite = yes;
&hide_dcl. declare char(7) claimtypedet;
method run();
set SPEEDY.TESTDATA;
if CLAIM = 'X' then CLAIMTYPEDET= 'ABC';
else if CLAIM = 'Y' then CLAIMTYPEDET= 'DEF';
end;
endthread;
data HOME_CLAIMS (compress=yes)/ overwrite = yes;
declare thread home_claims_thread t;
method run();
set from t threads=8;
end;
enddata;
run;
quit;
data HOME_CLAIMS(compress=yes);
set SPEEDY.TESTDATA;
if CLAIM = 'X' then CLAIMTYPEDET= 'ABC';
else if CLAIM = 'Y' then CLAIMTYPEDET= 'DEF';
run;
%end;
%end;
%mend;
%loop;
Real time/ CPU time in seconds:
Num | No Num | |
DS2 + dcl | 12/18 | 7/9 |
DS2 No dcl | 13/18 | 9/9 |
DS | 3.5/3.5 | 3.5/3.5 |
My data step is always much faster as I only use one path, so multi-threading actually slows things down.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.