BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bentleyj1
Quartz | Level 8

I'm reworking a program that has a data set with about 400 million records sorted by account number with one account number per time period.  There are about 300 time periods but each account can be in the data set only maybe 20 times.


We have to process by account_number because we need to derive running totals for each account number across all time periods.  
Some derived variables requires lagging.  Here's the existing code-.  For each record a series of lagged variables are created and then referenced in the code later.

 

 dpd_lag1=lag(dpd);  pay_lag1=lag(payment_amount);  prin_lag1=lag(payment);
 dpd_lag2=lag2(dpd);  pay_lag2=lag2(payment_amount);  prin_lag2=lag2(payment);
 dpd_lag3=lag3(dpd);  pay_lag3=lag3(payment_amount);  prin_lag3=lag3(payment);
 dpd_lag4=lag4(dpd);  pay_lag4=lag4(payment_amount);  prin_lag4=lag4(payment);
 dpd_lag5=lag5(dpd);  pay_lag5=lag5(payment_amount);  prin_lag5=lag5(payment);

 

 if loan_count=2 then do;
        max_dpd=max(dpd,dpd_lag1);
        max_dpd_2=dpd_lag1;

        if dq_monthly_payment > 0 then do;
          chrva421=(100*payment)/dq_monthly_payment;
          chrva421p=(100*sum(payment_amount,pay_lag1))/(dq_monthly_payment*2);
       end;
       else do;
          chrva421=.;
          chrva421p=.;
      end;
    end;

  else if loan_count=3 then do;
      max_dpd=max(dpd,dpd_lag1,dpd_lag2);
      max_dpd_2=max(dpd_lag1,dpd_lag2);

      if dq_monthly_payment > 0 then do;
         chrva421=(100*sum(payment,prin_lag1))/(dq_monthly_payment*2);
         chrva421p=(100*sum(payment_amount,pay_lag1,pay_lag2))/(dq_monthly_payment*3);
      end;
    else do;
      chrva421=.;
      chrva421p=.;
    end;
  end;

  ** continues up to loan_count=6 ;

 

The program takes Hours to run on our server. 

 

My question is, do you think it would be faster to lag the variables 'on the fly' instead of creating them as we're doing here?  We'd remove the lag assignment statements and the code would look like this with LAG functions used on-the-fly. (may be parenthesis issues in here.)

 

 if loan_count=2 then do;
        max_dpd=max(dpd,lag(dpd));
        max_dpd_2=lag(dpd);

       if dq_monthly_payment > 0 then do;
         chrva421=(100*payment)/dq_monthly_payment;
         chrva421p=(100*sum(payment_amount,lag(payment_amount))/(dq_monthly_payment*2);
       end;
       else do;
        chrva421=.;
        chrva421p=.;
      end;
    end;

  else if loan_count=3 then do;
    max_dpd=max(dpd,lag(dpd),lag2(dpd));
   max_dpd_2=max(dpd_lag1,dpd_lag2);

    if dq_monthly_payment > 0 then do;
     chrva421=(100*sum(payment,lag(payment))/(dq_monthly_payment*2);
     chrva421p=(100*sum(payment_amount,lag(payment),lag2(payment))/(dq_monthly_payment*3);
    end;
    else do;
      chrva421=.;
      chrva421p=.;
    end;
  end;
 
I'm also thinking of putting this into a macro do-loop.
 
Thanks in advance for your thoughts.
 

John

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
To me, you'd need to reset it at the beginning of each new ID wouldn't you?If you're only going up to 5 though and that makes sense you can just overwrite the numbers can't you. You can easily set it to missing with CALL MISSING().

View solution in original post

21 REPLIES 21
Reeza
Super User
If your data is sorted, what about a temporary array instead? And if you know it's only twenty maximum per ID that's the limit of the size you'd need and SAS will drop them automatically.
bentleyj1
Quartz | Level 8

I've thought of using an array but figured i'd ask for opinions on this solution first.  The issue with a _temporary_ array is that the values are retained-- "Temporary data element values are always automatically retained, rather than being reset to missing at the beginning of the next iteration of the DATA step."  http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000201956.htm

 

I need the array to be refreshed/reloaded with each iteration of the DATA step because an account can have more than 5 records.  So would anything be gained from an array versus assignment statements?

novinosrin
Tourmaline | Level 20

To counter -The issue with a _temporary_ array is that the values are retained-- "Temporary data element values are always automatically retained, rather than being reset to missing at the beginning of the next iteration of the DATA step."  

 

 array temp(9999) _temporary_;

call missing(of temp(*));

at the top of each iteration 

 

 

bentleyj1
Quartz | Level 8

Brilliant.  Thanks.  I might go with the temporary array solution... it's got to be faster than adding 15 variables to each record and then dropping them at the end of the step.

novinosrin
Tourmaline | Level 20

Yes Sir. i concur with @Reeza 's idea too as no output buffer is created and the variables are contiguous and so easy to peek/compute. Let us know if you need any further help. 

 

Reeza
Super User
To me, you'd need to reset it at the beginning of each new ID wouldn't you?If you're only going up to 5 though and that makes sense you can just overwrite the numbers can't you. You can easily set it to missing with CALL MISSING().
bentleyj1
Quartz | Level 8

Thanks to Everyone who replied.  As usual, I'm immensely grateful for the intelligent responses I received.

 

My largest concern was that for each record we were creating 15 lagged variables whether or not they were needed for the calculations--the number of needed lagged variables is based on the loan_count--if loan_count=2 then we need 2 lagged variables and only if it is more than 6 do we need all 15.  As a short-term fix until I can further investigate the temporary array approach I'm using this code that lags only the fields needed for the calculation.


 if loan_count = 2 then do;

   ** here create only the lagged variables we need. ;
   dpd_lag1 =  lag(dpd);  pay_lag1 =  lag(payment_amount); 

      max_dpd = max(dpd,dpd_lag1);
      max_dpd_2 = dpd_lag1;
      if dq_monthly_payment > 0 then do;
      CHRVA421 = (100 * payment) / dq_monthly_payment;
      CHRVA421P = (100 * sum(payment_amount,pay_lag1)) / (dq_monthly_payment * 2);
      end;
    end;

 

else if loan_count = 4 then do;
  dp_lag1 =  lag(dpd);     pay_lag1 =  lag(payment_amount);  prin_lag1 =  lag(payment);
  dpd_lag2 = lag2(dpd);  pay_lag2 = lag2(payment_amount);  prin_lag2 = lag2(payment);
  dpd_lag3 = lag3(dpd);  pay_lag3 = lag3(payment_amount);  

      max_dpd = max(dpd,dpd_lag1,dpd_lag2,dpd_lag3);
      max_dpd_2 = max(dpd_lag1,dpd_lag2,dpd_lag3);
      if dq_monthly_payment > 0 then do;
      CHRVA421 = (100 * sum(payment,prin_lag1,prin_lag2)) / (dq_monthly_payment * 3);
      CHRVA421P = (100 * sum(payment_amount,pay_lag1,pay_lag2,pay_lag3))/(dq_monthly_payment * 4);
      end;
 end;

 

else if loan_count > 6 then do;
  dpd_lag1 =  lag(dpd);   pay_lag1 =  lag(payment_amount);  prin_lag1 =  lag(payment);
  dpd_lag2 = lag2(dpd);  pay_lag2 = lag2(payment_amount);  prin_lag2 = lag2(payment);
  dpd_lag3 = lag3(dpd);  pay_lag3 = lag3(payment_amount);  prin_lag3 = lag3(payment);
  dpd_lag4 = lag4(dpd);  pay_lag4 = lag4(payment_amount);  prin_lag4 = lag4(payment);
  dpd_lag5 = lag5(dpd);  pay_lag5 = lag5(payment_amount);  prin_lag5 = lag5(payment);

   max_dpd = max(dpd,dpd_lag1,dpd_lag2,dpd_lag3,dpd_lag4,dpd_lag5);
   max_dpd_2 = max(dpd_lag1,dpd_lag2,dpd_lag3,dpd_lag4,dpd_lag5);
   if dq_monthly_payment > 0 then do;
     CHRVA421 = (100 * sum(payment,prin_lag1,prin_lag2,prin_lag3,prin_lag4,prin_lag5))/(dq_monthly_payment*6);
     CHRVA421P = (100 * sum(payment_amount,pay_lag1,pay_lag2,pay_lag3,pay_lag4,pay_lag5))/(dq_monthly_payment*6);
      end;

 end;

 

ChrisNZ
Tourmaline | Level 20

If your data was sorted by descending loan_count you slowly decrease the number of lags to be performed as you no longer need as many. Note that saving values before the SET statement was determined to be more efficient than saving them via the LAG() function.

 

Your code will not work.

The lag5 stack will only be updated when loan_count > 6, so will fall out of sync.

data HAVE;
 do I=1 to 15;
   output; 
  end; 
run;
data WANT; 
  set HAVE;
  J=lag(I);
  if mod(I,2) then K=lag(I);
  if mod(I,2) then L=lag2(I);
run;

 

I J K L
1 . . .
2 1 . .
3 2 1 .
4 3 . .
5 4 3 1
6 5 . .
7 6 5 3
8 7 . .
9 8 7 5
10 9 . .
11 10 9 7
12 11 . .
13 12 11 9
14 13 . .
15 14 13 11
data_null__
Jade | Level 19

How did you determine the LAG functions are causing the performance issue?

 

If you are executing the LAG or any function unnecessarily then not doing should improve performance.

 

Will the LAGs be right with when the are executed as you suggest?  i.e. the conditional LAG problem.

 

 

ChrisNZ
Tourmaline | Level 20

 

1. Arrays are a good option to explore

 

2. Your suggestion

if loan_count=2 then do;
        max_dpd=max(dpd,lag(dpd));

would not work. The LAG stack must be updated with each observation. The IF test prevents that.

 

3. I don't think you need 

  else do;
        chrva421=.;
        chrva421p=.;
      end;

 

4. Temporary arrays are much faster than named ones.

Astounding
PROC Star

One more thought ... it could be that the LAG function is slowing the program.  Note that these two programs generate identical resrults:

 

data want1;

set have;

pay_lag1 = lag(payment);

pay_lag2 = lag2(payment);

run;

 

data want2;

pay_lag2 = pay_lag1;

pay_lag1 = payment;

set have;

retain pay_lag1;

run;

 

It's less work to do this with numeric variables.  With character variables, you would have to set the length first.  But it could well be significantly faster than using LAG.

Oligolas
Barite | Level 11

Maybe two thoughts that come to my mind to complement what have been said:

- it looks like you only need to determine the n-1 lag of the current loan_Count if-clause
-
You may want to determine if you can make use of the SASFILE Statement, maybe by exploring if you can split up the huge dataset upon performing the operations by time period and account number.

 

________________________

- Cheers -

Kurt_Bremser
Super User

@Astounding wrote:

One more thought ... it could be that the LAG function is slowing the program.  Note that these two programs generate identical resrults:

 

data want1;

set have;

pay_lag1 = lag(payment);

pay_lag2 = lag2(payment);

run;

 

data want2;

pay_lag2 = pay_lag1;

pay_lag1 = payment;

set have;

retain pay_lag1;

run;

 

It's less work to do this with numeric variables.  With character variables, you would have to set the length first.  But it could well be significantly faster than using LAG.


I just ran a similar test against one of our larger datasets (100+ million obs, ~14 G compressed size):

data ww0.test2 (keep=x1 x2);
set source;
length x1 x2 4;
x1 = invar;
x2 = lag(invar);
run;

data ww0.test3 (keep=x1 x2);
set source;
length x1 x2 4;
retain x2;
x1 = invar;
output;
x2 = invar;
run;

In terms of CPU time, both methods performed similar, with only marginal differences. Outside factors (concurrent load, especially I/O) was much more significant with regards to real time needed.

 

My personal bottom line: I expect a Base SAS function (and one that's been around as long as lag()) to be optimized as much as can be, and not to be outdone by manually written code that does essentially the same (fill and move a FIFO chain)

ChrisNZ
Tourmaline | Level 20

@Kurt_Bremser

My experience differs, using LAG() takes 40% longer.

 

data TEST; 
  do PAYMENT = 1 to 1e8;
    output;
   end;
run;

sasfile TEST load;
 
data _null_;        * 7s CPU/Elapse;
  set TEST;
  PAY_LAG1 = lag (PAYMENT);
  PAY_LAG2 = lag2(PAYMENT);
run;

data _null_;        * 5s CPU/Elapse;
  retain PAY_LAG1;
  PAY_LAG2 = PAY_LAG1;
  PAY_LAG1 = PAYMENT;
  set TEST;
run;

sasfile TEST close;

 

> I expect a Base SAS function (and one that's been around as long as lag()) to be optimized as much as can be

Considering how badly functions perform when used in WHERE clauses compared to IF clauses, my optimism regarding optimisation doesn't reach the levels of yours. 🙂

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 2022 views
  • 11 likes
  • 9 in conversation