Solved: Lag performance over 400 million sorted records

bentleyj1 · Posted 10-23-2018 05:35 PM

I'm reworking a program that has a data set with about 400 million records sorted by account number with one account number per time period. There are about 300 time periods but each account can be in the data set only maybe 20 times.

We have to process by account_number because we need to derive running totals for each account number across all time periods.
Some derived variables requires lagging. Here's the existing code-. For each record a series of lagged variables are created and then referenced in the code later.

dpd_lag1=lag(dpd); pay_lag1=lag(payment_amount); prin_lag1=lag(payment);
dpd_lag2=lag2(dpd); pay_lag2=lag2(payment_amount); prin_lag2=lag2(payment);
dpd_lag3=lag3(dpd); pay_lag3=lag3(payment_amount); prin_lag3=lag3(payment);
dpd_lag4=lag4(dpd); pay_lag4=lag4(payment_amount); prin_lag4=lag4(payment);
dpd_lag5=lag5(dpd); pay_lag5=lag5(payment_amount); prin_lag5=lag5(payment);

if loan_count=2 then do;
max_dpd=max(dpd,dpd_lag1);
max_dpd_2=dpd_lag1;

        if dq_monthly_payment > 0 then do;
          chrva421=(100*payment)/dq_monthly_payment;
          chrva421p=(100*sum(payment_amount,pay_lag1))/(dq_monthly_payment*2);
       end;
       else do;
          chrva421=.;
          chrva421p=.;
      end;
    end;

else if loan_count=3 then do;
max_dpd=max(dpd,dpd_lag1,dpd_lag2);
max_dpd_2=max(dpd_lag1,dpd_lag2);

     if dq_monthly_payment > 0 then do;
         chrva421=(100*sum(payment,prin_lag1))/(dq_monthly_payment*2);
         chrva421p=(100*sum(payment_amount,pay_lag1,pay_lag2))/(dq_monthly_payment*3);
      end;
    else do;
      chrva421=.;
      chrva421p=.;
    end;
end;

** continues up to loan_count=6 ;

The program takes Hours to run on our server.

My question is, do you think it would be faster to lag the variables 'on the fly' instead of creating them as we're doing here? We'd remove the lag assignment statements and the code would look like this with LAG functions used on-the-fly. (may be parenthesis issues in here.)

if loan_count=2 then do;
max_dpd=max(dpd,lag(dpd));
max_dpd_2=lag(dpd);

       if dq_monthly_payment > 0 then do;
         chrva421=(100*payment)/dq_monthly_payment;
         chrva421p=(100*sum(payment_amount,lag(payment_amount))/(dq_monthly_payment*2);
       end;
       else do;
        chrva421=.;
        chrva421p=.;
      end;
    end;

else if loan_count=3 then do;
max_dpd=max(dpd,lag(dpd),lag2(dpd));
max_dpd_2=max(dpd_lag1,dpd_lag2);

    if dq_monthly_payment > 0 then do;
    chrva421=(100*sum(payment,lag(payment))/(dq_monthly_payment*2);
    chrva421p=(100*sum(payment_amount,lag(payment),lag2(payment))/(dq_monthly_payment*3);
    end;
    else do;
      chrva421=.;
      chrva421p=.;
    end;
end;

I'm also thinking of putting this into a macro do-loop.

Thanks in advance for your thoughts.

John

Reeza · Posted 10-23-2018 06:25 PM

To me, you'd need to reset it at the beginning of each new ID wouldn't you?If you're only going up to 5 though and that makes sense you can just overwrite the numbers can't you. You can easily set it to missing with CALL MISSING().

View solution in original post

Reeza · Posted 10-23-2018 05:39 PM

If your data is sorted, what about a temporary array instead? And if you know it's only twenty maximum per ID that's the limit of the size you'd need and SAS will drop them automatically.

bentleyj1 · Posted 10-23-2018 05:51 PM

I've thought of using an array but figured i'd ask for opinions on this solution first. The issue with a _temporary_ array is that the values are retained-- "Temporary data element values are always automatically retained, rather than being reset to missing at the beginning of the next iteration of the DATA step." http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000201956.htm

I need the array to be refreshed/reloaded with each iteration of the DATA step because an account can have more than 5 records. So would anything be gained from an array versus assignment statements?

novinosrin · Posted 10-23-2018 06:02 PM

To counter -The issue with a _temporary_ array is that the values are retained-- "Temporary data element values are always automatically retained, rather than being reset to missing at the beginning of the next iteration of the DATA step."

array temp(9999) _temporary_;

call missing(of temp(*));

at the top of each iteration

bentleyj1 · Posted 10-23-2018 06:05 PM

Brilliant. Thanks. I might go with the temporary array solution... it's got to be faster than adding 15 variables to each record and then dropping them at the end of the step.

novinosrin · Posted 10-23-2018 06:08 PM

Yes Sir. i concur with @Reeza 's idea too as no output buffer is created and the variables are contiguous and so easy to peek/compute. Let us know if you need any further help.

Reeza · Posted 10-23-2018 06:25 PM

To me, you'd need to reset it at the beginning of each new ID wouldn't you?If you're only going up to 5 though and that makes sense you can just overwrite the numbers can't you. You can easily set it to missing with CALL MISSING().

bentleyj1 · Posted 10-25-2018 05:31 PM

Thanks to Everyone who replied. As usual, I'm immensely grateful for the intelligent responses I received.

My largest concern was that for each record we were creating 15 lagged variables whether or not they were needed for the calculations--the number of needed lagged variables is based on the loan_count--if loan_count=2 then we need 2 lagged variables and only if it is more than 6 do we need all 15. As a short-term fix until I can further investigate the temporary array approach I'm using this code that lags only the fields needed for the calculation.

if loan_count = 2 then do;

** here create only the lagged variables we need. ;
dpd_lag1 = lag(dpd); pay_lag1 = lag(payment_amount);

      max_dpd = max(dpd,dpd_lag1);
max_dpd_2 = dpd_lag1;
      if dq_monthly_payment > 0 then do;
      CHRVA421 = (100 * payment) / dq_monthly_payment;
      CHRVA421P = (100 * sum(payment_amount,pay_lag1)) / (dq_monthly_payment * 2);
      end;
    end;

else if loan_count = 4 then do;
  dp_lag1 = lag(dpd); pay_lag1 = lag(payment_amount); prin_lag1 = lag(payment);
  dpd_lag2 = lag2(dpd); pay_lag2 = lag2(payment_amount); prin_lag2 = lag2(payment);
  dpd_lag3 = lag3(dpd); pay_lag3 = lag3(payment_amount);

      max_dpd = max(dpd,dpd_lag1,dpd_lag2,dpd_lag3);
max_dpd_2 = max(dpd_lag1,dpd_lag2,dpd_lag3);
      if dq_monthly_payment > 0 then do;
     CHRVA421 = (100 * sum(payment,prin_lag1,prin_lag2)) / (dq_monthly_payment * 3);
     CHRVA421P = (100 * sum(payment_amount,pay_lag1,pay_lag2,pay_lag3))/(dq_monthly_payment * 4);
      end;
end;

else if loan_count > 6 then do;
  dpd_lag1 = lag(dpd); pay_lag1 = lag(payment_amount); prin_lag1 = lag(payment);
  dpd_lag2 = lag2(dpd); pay_lag2 = lag2(payment_amount); prin_lag2 = lag2(payment);
  dpd_lag3 = lag3(dpd); pay_lag3 = lag3(payment_amount); prin_lag3 = lag3(payment);
  dpd_lag4 = lag4(dpd); pay_lag4 = lag4(payment_amount); prin_lag4 = lag4(payment);
  dpd_lag5 = lag5(dpd); pay_lag5 = lag5(payment_amount); prin_lag5 = lag5(payment);

   max_dpd = max(dpd,dpd_lag1,dpd_lag2,dpd_lag3,dpd_lag4,dpd_lag5);
   max_dpd_2 = max(dpd_lag1,dpd_lag2,dpd_lag3,dpd_lag4,dpd_lag5);
   if dq_monthly_payment > 0 then do;
     CHRVA421 = (100 * sum(payment,prin_lag1,prin_lag2,prin_lag3,prin_lag4,prin_lag5))/(dq_monthly_payment*6);
     CHRVA421P = (100 * sum(payment_amount,pay_lag1,pay_lag2,pay_lag3,pay_lag4,pay_lag5))/(dq_monthly_payment*6);
      end;

end;

ChrisNZ · Posted 10-25-2018 06:22 PM

If your data was sorted by descending loan_count you slowly decrease the number of lags to be performed as you no longer need as many. Note that saving values before the SET statement was determined to be more efficient than saving them via the LAG() function.

Your code will not work.

The lag5 stack will only be updated when loan_count > 6, so will fall out of sync.

data HAVE;
 do I=1 to 15;
   output; 
  end; 
run;
data WANT; 
  set HAVE;
  J=lag(I);
  if mod(I,2) then K=lag(I);
  if mod(I,2) then L=lag2(I);
run;

I	J	K	L
1	.	.	.
2	1	.	.
3	2	1	.
4	3	.	.
5	4	3	1
6	5	.	.
7	6	5	3
8	7	.	.
9	8	7	5
10	9	.	.
11	10	9	7
12	11	.	.
13	12	11	9
14	13	.	.
15	14	13	11

High-Performance SAS Coding - Third Edition

data_null__ · Posted 10-23-2018 06:15 PM

How did you determine the LAG functions are causing the performance issue?

If you are executing the LAG or any function unnecessarily then not doing should improve performance.

Will the LAGs be right with when the are executed as you suggest? i.e. the conditional LAG problem.

ChrisNZ · Posted 10-23-2018 08:58 PM

1. Arrays are a good option to explore

2. Your suggestion

if loan_count=2 then do;
max_dpd=max(dpd,lag(dpd));

would not work. The LAG stack must be updated with each observation. The IF test prevents that.

3. I don't think you need

else do;
        chrva421=.;
        chrva421p=.;
      end;

4. Temporary arrays are much faster than named ones.

High-Performance SAS Coding - Third Edition

Astounding · Posted 10-23-2018 11:17 PM

One more thought ... it could be that the LAG function is slowing the program. Note that these two programs generate identical resrults:

data want1;

set have;

pay_lag1 = lag(payment);

pay_lag2 = lag2(payment);

run;

data want2;

pay_lag2 = pay_lag1;

pay_lag1 = payment;

set have;

retain pay_lag1;

run;

It's less work to do this with numeric variables. With character variables, you would have to set the length first. But it could well be significantly faster than using LAG.

Oligolas · Posted 10-24-2018 02:49 AM

Maybe two thoughts that come to my mind to complement what have been said:

- it looks like you only need to determine the n-1 lag of the current loan_Count if-clause
- You may want to determine if you can make use of the SASFILE Statement, maybe by exploring if you can split up the huge dataset upon performing the operations by time period and account number.

________________________
- Cheers -

Kurt_Bremser · Posted 10-24-2018 03:42 AM

@Astounding wrote:

One more thought ... it could be that the LAG function is slowing the program. Note that these two programs generate identical resrults:

data want1;

set have;

pay_lag1 = lag(payment);

pay_lag2 = lag2(payment);

run;

data want2;

pay_lag2 = pay_lag1;

pay_lag1 = payment;

set have;

retain pay_lag1;

run;

It's less work to do this with numeric variables. With character variables, you would have to set the length first. But it could well be significantly faster than using LAG.

I just ran a similar test against one of our larger datasets (100+ million obs, ~14 G compressed size):

data ww0.test2 (keep=x1 x2);
set source;
length x1 x2 4;
x1 = invar;
x2 = lag(invar);
run;

data ww0.test3 (keep=x1 x2);
set source;
length x1 x2 4;
retain x2;
x1 = invar;
output;
x2 = invar;
run;

In terms of CPU time, both methods performed similar, with only marginal differences. Outside factors (concurrent load, especially I/O) was much more significant with regards to real time needed.

My personal bottom line: I expect a Base SAS function (and one that's been around as long as lag()) to be optimized as much as can be, and not to be outdone by manually written code that does essentially the same (fill and move a FIFO chain)

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 10-24-2018 05:15 PM

@Kurt_Bremser

My experience differs, using LAG() takes 40% longer.

data TEST; 
  do PAYMENT = 1 to 1e8;
    output;
   end;
run;

sasfile TEST load;
 
data _null_;        * 7s CPU/Elapse;
  set TEST;
  PAY_LAG1 = lag (PAYMENT);
  PAY_LAG2 = lag2(PAYMENT);
run;

data _null_;        * 5s CPU/Elapse;
  retain PAY_LAG1;
  PAY_LAG2 = PAY_LAG1;
  PAY_LAG1 = PAYMENT;
  set TEST;
run;

sasfile TEST close;

> I expect a Base SAS function (and one that's been around as long as lag()) to be optimized as much as can be

Considering how badly functions perform when used in WHERE clauses compared to IF clauses, my optimism regarding optimisation doesn't reach the levels of yours. 🙂

High-Performance SAS Coding - Third Edition

Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

Re: Lag performance over 400 million sorted records

I	J	K	L
1	.	.	.
2	1	.	.
3	2	1	.
4	3	.	.
5	4	3	1
6	5	.	.
7	6	5	3
8	7	.	.
9	8	7	5
10	9	.	.
11	10	9	7
12	11	.	.
13	12	11	9
14	13	.	.
15	14	13	11

I	J	K	L
1	.	.	.
2	1	.	.
3	2	1	.
4	3	.	.
5	4	3	1
6	5	.	.
7	6	5	3
8	7	.	.
9	8	7	5
10	9	.	.
11	10	9	7
12	11	.	.
13	12	11	9
14	13	.	.
15	14	13	11

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!

I	J	K	L
1	.	.	.
2	1	.	.
3	2	1	.
4	3	.	.
5	4	3	1
6	5	.	.
7	6	5	3
8	7	.	.
9	8	7	5
10	9	.	.
11	10	9	7
12	11	.	.
13	12	11	9
14	13	.	.
15	14	13	11