BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Selli5
Fluorite | Level 6

Hello,

 

I would like to apply the lag function to the debit variable, the exact same number of times as the count variable (except for when there is only one accountnum, then there would be no lag).  For some accounts the count variable will go into the thousands.


data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;

 

thanks

Sally

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

While I strongly suggest that you study @mkeintz's suggested solution in-depth (to see how some quite arcane SAS functions can be used), I probably would not use it in practice/production.

Why?

  • most performance can be gained by reducing I/O; both methods require the same number of reads through the dataset
  • using the peek/poke functions this way is equivalent to C pointer math, which a prominent hacker I admire quite a lot has recently called a "defect attractor".
  • Such code meets my definition of "clever" (see Maxim 29), and can cause unnecessary confusion when it needs to be maintained by a novice. If you constantly write code that only you can understand, you'll never be able to delegate work.

So it all ends up as a clear case for Maxims 4 & 30. Only if a significant performance gain is observed (say, reducing runtime from 5 to 3 hours), should a more complex method be implemented.

Even if you shave off 90%, but only gain 9 seconds because the original run took just 10, I'd stay with the simple algorithm. Unless you know that the job has to scale up to much larger data in the future.

View solution in original post

12 REPLIES 12
mkeintz
PROC Star

Please clarify your request by showing what the desired result should look like.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
novinosrin
Tourmaline | Level 20

@Selli5 if i understand you correctly, arrays should suffice

increase the subscript to 1000s if you want

 

data test;
input accountnum trandate :DATE9. debit  count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;

data want;
do until(last.accountnum);
set test;
by 	accountnum;
array lag_debit(100);
if first.accountnum then call missing(of lag_debit(*));
lag_debit(count)=debit;
output;
end;
run;


Selli5
Fluorite | Level 6

%MACRO LAGLOOP ;
  data temp3 ;
    set temp2 ;
    by accountnum trandate;
  %DO J = 1 %TO 3000 ;
 /* %DO J = 1 %TO J le count;*/
 
      lag_&J = lag&J(debit) ;    
   if first.accountnum then do;
      lag_&J = .;
      end;
%END;
  run ;
%MEND ;
%LAGLOOP ;

 

This is the closest I have got to the answer, however the lag_1 etc is not unique to accountnum, its just looking at the previous debit regardless of accountnum:

Obs count count2 accountnum trancode trandate Debit lag_1 lag_2 lag_3 lag_4 lag_5 etc

111000012630APR2018$-2,000.00.....
311000042602MAY2018$-1,000.00.....
11100005208MAY2018$-1,200.00.....
22100005222MAY2018$-1,200.00-1200.00-1000.00-2000.00..
30110000678007MAY2018$-51.25.....
31210000678008MAY2018$-112.25-51.25-1200.00-1200.00-1000.00-2000.00
32310000678011MAY2018$-22.50-112.25-51.25-1200.00-1200.00-1000.00
33410000678015MAY2018$-26.34-22.50-112.25-51.25-1200.00-1200.00
34510000678017MAY2018$-100.65-26.34-22.50-112.25-51.25-1200.00
35610000678023MAY2018$-69.30-100.65-26.34-22.50-112.25-51.25

 

 

Kurt_Bremser
Super User

You don't need a macro at all.

data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;

proc sql noprint;
select max(count) - 1 into :maxcount trimmed from test;
quit;

data want;
set test;
array debits {&maxcount} debit_1-debit_&maxcount;
retain debits;
by accountnum;
prevdebit = lag(debit);
if first.accountnum
then do;
  do index = 1 to &maxcount;
    debits{index} = .;
  end;
end;
else do;
  do index = &maxcount to 2 by -1;
    debits{index} = debits{index-1};
  end;
  debits{1} = prevdebit;
end;
drop prevdebit index;
run;

The only "macro" thing here is the macrovar maxcount, to dynamically determine the number of columns needed.

Selli5
Fluorite | Level 6

Hi,  Below is the first 10 obs.  When I run the code it lag_1 is referring to the current debit instead of the previous debit and the lag function is also excluded from the last account (10006) altogether: How do you fix this?

 

thanks

Sally

 

count       accountnum trandate    Debit            lag_1     lag_2     lag_3     lag_4    lag_5

110000130APR2018$-2,000.00-2000....
110000402MAY2018$-1,000.00..-1000..
110000508MAY2018$-1,200.00-1200....
210000522MAY2018$-1,200.00-1200-1200...
110000607MAY2018$-51.25.....
210000608MAY2018$-112.25.....
310000611MAY2018$-22.50.....
410000615MAY2018$-26.34.....
510000617MAY2018$-100.65.....
610000623MAY2018$-69.30

 

 

novinosrin
Tourmaline | Level 20

Please post a sample of your expected output for the input sample you provided. Thank you

Selli5
Fluorite | Level 6

Thanks, but I get this error message:

 ERROR: Array subscript out of range at line 69 column 1.

 

Selli5
Fluorite | Level 6

Thanks its working now. 

 

Is there anyway to instead of writing an exact number...eg (array lag_debit(3000);,)

 

it can do the exact number of lags = the last.accountnum count variable?

 

eg if there are 6 occurances of an account and the last accountnum has count=6 then do 6 lags and if an accountnum has count=3000 on the last.accountnum then do 3000 lags?

 

Sally

Kurt_Bremser
Super User

@Selli5 wrote:

Thanks its working now. 

 

Is there anyway to instead of writing an exact number...eg (array lag_debit(3000);,)

 

it can do the exact number of lags = the last.accountnum count variable?

 

eg if there are 6 occurances of an account and the last accountnum has count=6 then do 6 lags and if an accountnum has count=3000 on the last.accountnum then do 3000 lags?

 

Sally


See this slight adaptation of my code:

data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;

proc sql noprint;
select max(count) - 1 into :maxcount trimmed from test;
quit;

data want;
set test;
array debits {&maxcount} debit_1-debit_&maxcount;
retain debits;
by accountnum;
prevdebit = lag(debit);
if first.accountnum
then do;
  do index = 1 to &maxcount;
    debits{index} = .;
  end;
end;
else do;
  do index = count - 1 to 2 by -1;
    debits{index} = debits{index-1};
  end;
  debits{1} = prevdebit;
end;
drop prevdebit index;
run;

proc print data=want noobs;
run;

Result:

accountnum    trandate     debit    count    debit_1    debit_2    debit_3    debit_4

  100001        21304     2000.0      1           .          .          .          . 
  100002        21327     3000.0      1           .          .          .          . 
  100002        21330     2500.5      2       3000.0         .          .          . 
  100002        21331     2100.0      3       2500.5     3000.0         .          . 
  100002        21332     2400.0      4       2100.0     2500.5     3000.0         . 
  100002        21333     2500.1      5       2400.0     2100.0     2500.5      3000 
  100003        21327     2450.1      1           .          .          .          . 
  100003        21328     2500.5      2       2450.1         .          .          . 
mkeintz
PROC Star

Clearly the resolution of this problem is much more amenable to arrays than use of multiple lag functions.

 

Although @Kurt_Bremser's solution is straightforward, it is burdened with the time needed to execute loops of increasing size as the count variable grows.  This is an excellent problem to consider a more sophisticated way (#2 below) of handling the procession of long arrays of values.

 

I suggest taking advantage of arrays in these additional ways:

  1. Use the call missing(of debits{*}) to reset the entire array to missing while avoiding the construction and execution of a loop.  That's in the "if first.accountnum" statement.
  2. In the "else" statement, take advantage of the fact that array elements are stored in contiguous locations in memory, with each numeric element occupying 8 bytes.  So if you make provision for an array of 3000 elements, it is stored in 24,000 contiguous bytes in memory.   As a result, if you need to shift 1000 elements (when variable COUNT=1001) to the right by one position in the array, you could use one instruction to move 8000 bytes (1000 elements * 8 bytes/element) to the right by 8 bytes, instead of 1000 iterations of copying a single element one position to the right. The bigger the COUNT variable, the more efficient this approach will be.  It uses the ADDRLONG, PEEKCLONG, and CALL POKE functions, in the program below.  I've added a few notes about them afterwards.

 

data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;

proc sql noprint;
select max(count) - 1 into :maxcount trimmed from test;
quit;
%put &=maxcount;


data want;
  set test;
  by accountnum;

  array debits {*} prevdebit debit_1-debit_&maxcount;
  retain debits;

  if _n_=1 then do;
    addrprev=addrlong(prevdebit);
addr1=addrlong(debit_1); end; retain addr: ; prevdebit = lag(debit); if first.accountnum then call missing (of debits{*}); else call pokelong(peekclong(addrprev,8*(count-1)),addr1); drop prevdebit addr: ; run; proc print data=want noobs; run;

 

 

Notes:

  1. I include PREVDEBIT at the beginning of the array DEBITS, so it is included in the array procession operations.
  2. The ADDRLONG function gets the memory address of PREVDEBIT (the left-end of the source of the transfer) and DEBIT_1 (left-end of the destination).  I put the address generation in an "IF _N_=1" do group, because the addresses are fixed, and need to be calculated only once, instead of for each incoming observation.  By retaining the results (addrprev and addr1) they are available for use in the call poke function for every record.
  3. The PEEKCLONG(addrprev,8*(count-1))  function retrieves a contiguous string of 8*(count-1) bytes starting at addrprev, which is the leftmost location of the array.
  4. The peekclong is embedded in a CALL POKE routine which takes the content of the first argument (i.e. the bytes retrieved in #3 above) and stores them at the address specified in the 2nd argument, i.e. addr1.

 

I'd be very interested in the speed comparisons of this program versus the do-loop approach.

 

Edited additional notes:

It's possible the "call missing(of debits{*})" is also a bit inefficient.  It will always reset to missing the entire array even if only 10% of it has non-missing values.  That could be solved by (1) adding a temporary array of missing values, (2) getting lag of count, and (3) using call pokelong instead of call missing, as in:

 

data want;
  set test;
  by accountnum;

  array debits {*} prevdebit debit_1-debit_&maxcount;
  array misvals {0:&maxcount} _temporary_;   /* New statement */
  retain debits;

  if _n_=1 then do;
    addrprev=addrlong(prevdebit);
    addr1=addrlong(debit_1);
    addrmis=addrlong(misvals{0});     /* New statement */
  end;
  retain addr: ;
  prevdebit = lag(debit);
  prevcount = lag(count);

  if first.accountnum then call pokelong(peekclong(addrmis,8*prevcount),addrprev);    /* Modified statement */
  else call pokelong(peekclong(addrprev,8*(count-1)),addr1);

drop prev: addr: ;
run;

proc print data=want noobs;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Kurt_Bremser
Super User

While I strongly suggest that you study @mkeintz's suggested solution in-depth (to see how some quite arcane SAS functions can be used), I probably would not use it in practice/production.

Why?

  • most performance can be gained by reducing I/O; both methods require the same number of reads through the dataset
  • using the peek/poke functions this way is equivalent to C pointer math, which a prominent hacker I admire quite a lot has recently called a "defect attractor".
  • Such code meets my definition of "clever" (see Maxim 29), and can cause unnecessary confusion when it needs to be maintained by a novice. If you constantly write code that only you can understand, you'll never be able to delegate work.

So it all ends up as a clear case for Maxims 4 & 30. Only if a significant performance gain is observed (say, reducing runtime from 5 to 3 hours), should a more complex method be implemented.

Even if you shave off 90%, but only gain 9 seconds because the original run took just 10, I'd stay with the simple algorithm. Unless you know that the job has to scale up to much larger data in the future.

Reeza
Super User

If you want the lag but not for the first record you can set it to missing using FIRST. 


Lag is kind of tricky to understand, so I highly recommend reading this paper to understand how it works. 

http://support.sas.com/resources/papers/proceedings09/055-2009.pdf

 

Making some massive assumptions, I think this is what you're looking for. Note the BY statement which may mean you need to sort the data. 

 

data want;
set test;

by accountNum;

lag_debit=lag(debit);

if first.accountNum then lag_debit = . ;

run;

@Selli5 wrote:

Hello,

 

I would like to apply the lag function to the debit variable, the exact same number of times as the count variable (except for when there is only one accountnum, then there would be no lag).  For some accounts the count variable will go into the thousands.


data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;

 

thanks

Sally


 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1653 views
  • 2 likes
  • 5 in conversation