Interesting. This is the log when I ran your test on my server:
35 data _null_; * 7s CPU/Elapse; 36 set TEST; 37 PAY_LAG1 = lag (PAYMENT); 38 PAY_LAG2 = lag2(PAYMENT); 39 run; NOTE: There were 100000000 observations read from the data set WORK.TEST. NOTE: DATA statement used (Total process time): real time 44.22 seconds cpu time 5.75 seconds 40 41 data _null_; * 5s CPU/Elapse; 2 Das SAS System 08:26 Thursday, October 25, 2018 42 retain PAY_LAG1; 43 PAY_LAG2 = PAY_LAG1; 44 PAY_LAG1 = PAYMENT; 45 set TEST; 46 run; NOTE: There were 100000000 observations read from the data set WORK.TEST. NOTE: DATA statement used (Total process time): real time 46.69 seconds cpu time 4.88 seconds
SAS 9.4M5, AIX 7.1, 2 POWER8 cores, MEMSIZE 256M (that's why SASFILE won't work).
All this with a rather constant run queue around 8, and one can see that the potential gain in real time is negligible (actually, in this particular test there wasn't one), as the factor I/O is much bigger.
When reducing the obs count to 1e7, sasfile was possible, and I got this:
35 data _null_; * 7s CPU/Elapse; 36 set TEST; 37 PAY_LAG1 = lag (PAYMENT); 38 PAY_LAG2 = lag2(PAYMENT); 39 run; NOTE: There were 10000000 observations read from the data set WORK.TEST. NOTE: DATA statement used (Total process time): real time 3.97 seconds cpu time 0.54 seconds 40 41 data _null_; * 5s CPU/Elapse; 42 retain PAY_LAG1; 43 PAY_LAG2 = PAY_LAG1; 44 PAY_LAG1 = PAYMENT; 45 set TEST; 46 run; NOTE: There were 10000000 observations read from the data set WORK.TEST. NOTE: DATA statement used (Total process time): real time 3.53 seconds cpu time 0.44 seconds
Once again, the difference in CPU seconds lies in the 10-15% range.
What environment do you use for SAS?
I am testing under Windows. I can post the log tomorrow, but there is nothing to be learnt there I think.
>as the factor I/O is much bigger
The idea is to test whether retrieving previous values with LAG() is slower than saving values before they are overwritten by SET.
It may well be that, depending on other parameters, this difference in efficiency makes no difference in various cases.
That is besides the point. That's why I used data _null_ and sasfile : to isolate the process we are measuring.
Once again, the difference in CPU seconds lies in the 10-15% range.
Your numbers show an increase in CPU time when using LAG of 17 and 22%. Not as high as my test (maybe due to the high load you have?), but enough to conclude that LAG is less efficient imho.
MEMSIZE 256M
Oh wow that's low! So much CPU power so little RAM space!
@ChrisNZ wrote:
MEMSIZE 256M
Oh wow that's low! So much CPU power so little RAM space!
That's following a recommendation by SAS themselves. If you allow SAS lots of memory in a heavily multi-user setup, SAS will (in most cases with data steps and "simple" procedures that do not need much memory for themselves) "waste" memory by caching SAS dataset data internally, when the operating system already does that on it's own. In extreme, the OS would then have to page out program pages that actually contain data which is already present in the portion the OS reserves for caching disk data. Double whammy. Restricting SAS memory prevents that and makes the most of the operating system's capabilities in terms of read-ahead, sharing cached data and the like.
So we work with the least memory that's necessary for our codes to work.
At the moment I ran my tests, we had our monthly closure running, so the accountants and actuaries where riding the server hardest, pushing up the run queue and the I/O load.
The batch programs I develop are also meant to run in parallel, so using memory economically is also a must.
Remember, in this application there are several variables that need to be lagged 5 times. This might be a more appropriate test of the savings:
data want1;
set have;
pay_lag1 = lag(payment);
pay_lag2 = lag2(payment);
pay_lag3 = lag3(payment);
pay_lag4 = lag4(payment);
pay_lag5 = lag5(payment);
run;
data want2;
pay_lag5 = pay_lag4;
pay_lag4 = pay_lag3;
pay_lag3 = pay_lag2;
pay_lag2 = pay_lag1;
pay_lag1 = payment;
set have;
retain pay_lag1 - pay_lag5;
run;
Yes, I know I don't need to retain PAY_LAG5. But that can only help the program run a hair faster.
With 5 lags vs. 5 retains, the difference becomes significant. With 3e7 observations I got
real time 3.83 seconds cpu time 2.13 seconds
for the lag() vs.
real time 2.32 seconds cpu time 1.29 seconds
for the retain version. All tests ran from a dataset loaded with sasfile and used in a data _null_ step, so I/O was mostly removed as a factor.
I hereby revoke my former statement about SAS functions being optimized.
I then added a step with a temporary array:
data _null_;
set have;
array pay_lag{5} _temporary_;
do i = 1 to 4;
pay_lag{i} = pay_lag{i+1};
end;
pay_lag{5} = payment;
run;
and this turned out to be the slowest of all, more than double the CPU time of the retain version.
Who's the culprit here? It turns out to be the do loop, because (after digging out the old C programmer in me who uses #DEFINEs to speed up the compilate)
data _null_;
set have;
array pay_lag{5} _temporary_;
%macro mymac;
%do i = 1 %to 4;
pay_lag{&i} = pay_lag{%eval(&i+1)};
%end;
%mend;
%mymac
pay_lag{5} = payment;
run;
performed as well as the retain version.
Bottom line: if you engage in optimizing code in this way, creating lots of statements with a macro can save significantly in runtime.
1. if you engage in optimizing code in this way, creating lots of statements with a macro can save significantly in runtime.
Top notch optimisation research here Kurt. Well done! 🙂
2.SAS will "waste" memory by caching SAS dataset data internally, when the operating system already does that on it's own
- I've never heard of this before. As far as I know, a step allocates virtual memory based on best guess upon starting, requests more as/if needed until the value of ~MEMSIZE is reached, and then uses utility files if that's a feature supported by that step. At the end of the step, all the virtual memory is released.
- The latest value seems to be a more reasonable 2G (it was 512M in version 9.3).
I think the temporary array idea can work, and I do not think you have to initialize the array for each account. Just keep track of the number of records you have read from that account, something like
data want;
array dpd_lag(100) 8 _temporary_;
array pay_lag(100) 8 _temporary_;
array prin_lag(100) 8 _temporary_;
do _N_=1 by 1 until(last.account_number);
set have;
by account_number;
/* assuming that you want to do the calculation for loan_count=1 as well, with no lag, I put the current variables in also */
dpd_lag(_N_)=dpd;
pay_lag(_N_)=payment_amount;
prin_lag(_N_)=payment;
if loan_count>_N_ then /* would be the same as getting the lag from a previous account */
error 'Wrong loan count';
else do;
lag_index=_N_-loan_count+1;
/* do the calculation here, using dpd_lag(lag_index) etc. as the lagged vars */
/* I assume the calculation is basically the same for all loan counts */
output;
end;
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.