I'm reworking a program that has a data set with about 400 million records sorted by account number with one account number per time period. There are about 300 time periods but each account can be in the data set only maybe 20 times.
We have to process by account_number because we need to derive running totals for each account number across all time periods.
Some derived variables requires lagging. Here's the existing code-. For each record a series of lagged variables are created and then referenced in the code later.
dpd_lag1=lag(dpd); pay_lag1=lag(payment_amount); prin_lag1=lag(payment);
dpd_lag2=lag2(dpd); pay_lag2=lag2(payment_amount); prin_lag2=lag2(payment);
dpd_lag3=lag3(dpd); pay_lag3=lag3(payment_amount); prin_lag3=lag3(payment);
dpd_lag4=lag4(dpd); pay_lag4=lag4(payment_amount); prin_lag4=lag4(payment);
dpd_lag5=lag5(dpd); pay_lag5=lag5(payment_amount); prin_lag5=lag5(payment);
if loan_count=2 then do;
max_dpd=max(dpd,dpd_lag1);
max_dpd_2=dpd_lag1;
if dq_monthly_payment > 0 then do;
chrva421=(100*payment)/dq_monthly_payment;
chrva421p=(100*sum(payment_amount,pay_lag1))/(dq_monthly_payment*2);
end;
else do;
chrva421=.;
chrva421p=.;
end;
end;
else if loan_count=3 then do;
max_dpd=max(dpd,dpd_lag1,dpd_lag2);
max_dpd_2=max(dpd_lag1,dpd_lag2);
if dq_monthly_payment > 0 then do;
chrva421=(100*sum(payment,prin_lag1))/(dq_monthly_payment*2);
chrva421p=(100*sum(payment_amount,pay_lag1,pay_lag2))/(dq_monthly_payment*3);
end;
else do;
chrva421=.;
chrva421p=.;
end;
end;
** continues up to loan_count=6 ;
The program takes Hours to run on our server.
My question is, do you think it would be faster to lag the variables 'on the fly' instead of creating them as we're doing here? We'd remove the lag assignment statements and the code would look like this with LAG functions used on-the-fly. (may be parenthesis issues in here.)
if loan_count=2 then do;
max_dpd=max(dpd,lag(dpd));
max_dpd_2=lag(dpd);
if dq_monthly_payment > 0 then do;
chrva421=(100*payment)/dq_monthly_payment;
chrva421p=(100*sum(payment_amount,lag(payment_amount))/(dq_monthly_payment*2);
end;
else do;
chrva421=.;
chrva421p=.;
end;
end;
else if loan_count=3 then do;
max_dpd=max(dpd,lag(dpd),lag2(dpd));
max_dpd_2=max(dpd_lag1,dpd_lag2);
if dq_monthly_payment > 0 then do;
chrva421=(100*sum(payment,lag(payment))/(dq_monthly_payment*2);
chrva421p=(100*sum(payment_amount,lag(payment),lag2(payment))/(dq_monthly_payment*3);
end;
else do;
chrva421=.;
chrva421p=.;
end;
end;
I'm also thinking of putting this into a macro do-loop.
Thanks in advance for your thoughts.
John
I've thought of using an array but figured i'd ask for opinions on this solution first. The issue with a _temporary_ array is that the values are retained-- "Temporary data element values are always automatically retained, rather than being reset to missing at the beginning of the next iteration of the DATA step." http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000201956.htm
I need the array to be refreshed/reloaded with each iteration of the DATA step because an account can have more than 5 records. So would anything be gained from an array versus assignment statements?
To counter -The issue with a _temporary_ array is that the values are retained-- "Temporary data element values are always automatically retained, rather than being reset to missing at the beginning of the next iteration of the DATA step."
array temp(9999) _temporary_;
call missing(of temp(*));
at the top of each iteration
Brilliant. Thanks. I might go with the temporary array solution... it's got to be faster than adding 15 variables to each record and then dropping them at the end of the step.
Yes Sir. i concur with @Reeza 's idea too as no output buffer is created and the variables are contiguous and so easy to peek/compute. Let us know if you need any further help.
Thanks to Everyone who replied. As usual, I'm immensely grateful for the intelligent responses I received.
My largest concern was that for each record we were creating 15 lagged variables whether or not they were needed for the calculations--the number of needed lagged variables is based on the loan_count--if loan_count=2 then we need 2 lagged variables and only if it is more than 6 do we need all 15. As a short-term fix until I can further investigate the temporary array approach I'm using this code that lags only the fields needed for the calculation.
if loan_count = 2 then do;
** here create only the lagged variables we need. ;
dpd_lag1 = lag(dpd); pay_lag1 = lag(payment_amount);
max_dpd = max(dpd,dpd_lag1);
max_dpd_2 = dpd_lag1;
if dq_monthly_payment > 0 then do;
CHRVA421 = (100 * payment) / dq_monthly_payment;
CHRVA421P = (100 * sum(payment_amount,pay_lag1)) / (dq_monthly_payment * 2);
end;
end;
else if loan_count = 4 then do;
dp_lag1 = lag(dpd); pay_lag1 = lag(payment_amount); prin_lag1 = lag(payment);
dpd_lag2 = lag2(dpd); pay_lag2 = lag2(payment_amount); prin_lag2 = lag2(payment);
dpd_lag3 = lag3(dpd); pay_lag3 = lag3(payment_amount);
max_dpd = max(dpd,dpd_lag1,dpd_lag2,dpd_lag3);
max_dpd_2 = max(dpd_lag1,dpd_lag2,dpd_lag3);
if dq_monthly_payment > 0 then do;
CHRVA421 = (100 * sum(payment,prin_lag1,prin_lag2)) / (dq_monthly_payment * 3);
CHRVA421P = (100 * sum(payment_amount,pay_lag1,pay_lag2,pay_lag3))/(dq_monthly_payment * 4);
end;
end;
else if loan_count > 6 then do;
dpd_lag1 = lag(dpd); pay_lag1 = lag(payment_amount); prin_lag1 = lag(payment);
dpd_lag2 = lag2(dpd); pay_lag2 = lag2(payment_amount); prin_lag2 = lag2(payment);
dpd_lag3 = lag3(dpd); pay_lag3 = lag3(payment_amount); prin_lag3 = lag3(payment);
dpd_lag4 = lag4(dpd); pay_lag4 = lag4(payment_amount); prin_lag4 = lag4(payment);
dpd_lag5 = lag5(dpd); pay_lag5 = lag5(payment_amount); prin_lag5 = lag5(payment);
max_dpd = max(dpd,dpd_lag1,dpd_lag2,dpd_lag3,dpd_lag4,dpd_lag5);
max_dpd_2 = max(dpd_lag1,dpd_lag2,dpd_lag3,dpd_lag4,dpd_lag5);
if dq_monthly_payment > 0 then do;
CHRVA421 = (100 * sum(payment,prin_lag1,prin_lag2,prin_lag3,prin_lag4,prin_lag5))/(dq_monthly_payment*6);
CHRVA421P = (100 * sum(payment_amount,pay_lag1,pay_lag2,pay_lag3,pay_lag4,pay_lag5))/(dq_monthly_payment*6);
end;
end;
If your data was sorted by descending loan_count you slowly decrease the number of lags to be performed as you no longer need as many. Note that saving values before the SET statement was determined to be more efficient than saving them via the LAG() function.
Your code will not work.
The lag5 stack will only be updated when loan_count > 6, so will fall out of sync.
data HAVE;
do I=1 to 15;
output;
end;
run;
data WANT;
set HAVE;
J=lag(I);
if mod(I,2) then K=lag(I);
if mod(I,2) then L=lag2(I);
run;
I | J | K | L |
---|---|---|---|
1 | . | . | . |
2 | 1 | . | . |
3 | 2 | 1 | . |
4 | 3 | . | . |
5 | 4 | 3 | 1 |
6 | 5 | . | . |
7 | 6 | 5 | 3 |
8 | 7 | . | . |
9 | 8 | 7 | 5 |
10 | 9 | . | . |
11 | 10 | 9 | 7 |
12 | 11 | . | . |
13 | 12 | 11 | 9 |
14 | 13 | . | . |
15 | 14 | 13 | 11 |
How did you determine the LAG functions are causing the performance issue?
If you are executing the LAG or any function unnecessarily then not doing should improve performance.
Will the LAGs be right with when the are executed as you suggest? i.e. the conditional LAG problem.
1. Arrays are a good option to explore
2. Your suggestion
if loan_count=2 then do;
max_dpd=max(dpd,lag(dpd));
would not work. The LAG stack must be updated with each observation. The IF test prevents that.
3. I don't think you need
else do;
chrva421=.;
chrva421p=.;
end;
4. Temporary arrays are much faster than named ones.
One more thought ... it could be that the LAG function is slowing the program. Note that these two programs generate identical resrults:
data want1;
set have;
pay_lag1 = lag(payment);
pay_lag2 = lag2(payment);
run;
data want2;
pay_lag2 = pay_lag1;
pay_lag1 = payment;
set have;
retain pay_lag1;
run;
It's less work to do this with numeric variables. With character variables, you would have to set the length first. But it could well be significantly faster than using LAG.
Maybe two thoughts that come to my mind to complement what have been said:
- it looks like you only need to determine the n-1 lag of the current loan_Count if-clause
- You may want to determine if you can make use of the SASFILE Statement, maybe by exploring if you can split up the huge dataset upon performing the operations by time period and account number.
- Cheers -
@Astounding wrote:
One more thought ... it could be that the LAG function is slowing the program. Note that these two programs generate identical resrults:
data want1;
set have;
pay_lag1 = lag(payment);
pay_lag2 = lag2(payment);
run;
data want2;
pay_lag2 = pay_lag1;
pay_lag1 = payment;
set have;
retain pay_lag1;
run;
It's less work to do this with numeric variables. With character variables, you would have to set the length first. But it could well be significantly faster than using LAG.
I just ran a similar test against one of our larger datasets (100+ million obs, ~14 G compressed size):
data ww0.test2 (keep=x1 x2);
set source;
length x1 x2 4;
x1 = invar;
x2 = lag(invar);
run;
data ww0.test3 (keep=x1 x2);
set source;
length x1 x2 4;
retain x2;
x1 = invar;
output;
x2 = invar;
run;
In terms of CPU time, both methods performed similar, with only marginal differences. Outside factors (concurrent load, especially I/O) was much more significant with regards to real time needed.
My personal bottom line: I expect a Base SAS function (and one that's been around as long as lag()) to be optimized as much as can be, and not to be outdone by manually written code that does essentially the same (fill and move a FIFO chain)
My experience differs, using LAG() takes 40% longer.
data TEST;
do PAYMENT = 1 to 1e8;
output;
end;
run;
sasfile TEST load;
data _null_; * 7s CPU/Elapse;
set TEST;
PAY_LAG1 = lag (PAYMENT);
PAY_LAG2 = lag2(PAYMENT);
run;
data _null_; * 5s CPU/Elapse;
retain PAY_LAG1;
PAY_LAG2 = PAY_LAG1;
PAY_LAG1 = PAYMENT;
set TEST;
run;
sasfile TEST close;
> I expect a Base SAS function (and one that's been around as long as lag()) to be optimized as much as can be
Considering how badly functions perform when used in WHERE clauses compared to IF clauses, my optimism regarding optimisation doesn't reach the levels of yours. 🙂
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.