Hello,
I would like to apply the lag function to the debit variable, the exact same number of times as the count variable (except for when there is only one accountnum, then there would be no lag). For some accounts the count variable will go into the thousands.
data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;
thanks
Sally
While I strongly suggest that you study @mkeintz's suggested solution in-depth (to see how some quite arcane SAS functions can be used), I probably would not use it in practice/production.
Why?
So it all ends up as a clear case for Maxims 4 & 30. Only if a significant performance gain is observed (say, reducing runtime from 5 to 3 hours), should a more complex method be implemented.
Even if you shave off 90%, but only gain 9 seconds because the original run took just 10, I'd stay with the simple algorithm. Unless you know that the job has to scale up to much larger data in the future.
Please clarify your request by showing what the desired result should look like.
@Selli5 if i understand you correctly, arrays should suffice
increase the subscript to 1000s if you want
data test;
input accountnum trandate :DATE9. debit count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;
data want;
do until(last.accountnum);
set test;
by accountnum;
array lag_debit(100);
if first.accountnum then call missing(of lag_debit(*));
lag_debit(count)=debit;
output;
end;
run;
%MACRO LAGLOOP ;
data temp3 ;
set temp2 ;
by accountnum trandate;
%DO J = 1 %TO 3000 ;
/* %DO J = 1 %TO J le count;*/
lag_&J = lag&J(debit) ;
if first.accountnum then do;
lag_&J = .;
end;
%END;
run ;
%MEND ;
%LAGLOOP ;
This is the closest I have got to the answer, however the lag_1 etc is not unique to accountnum, its just looking at the previous debit regardless of accountnum:
Obs count count2 accountnum trancode trandate Debit lag_1 lag_2 lag_3 lag_4 lag_5 etc
1 | 1 | 100001 | 26 | 30APR2018 | $-2,000.00 | . | . | . | . | . |
3 | 1 | 100004 | 26 | 02MAY2018 | $-1,000.00 | . | . | . | . | . |
1 | 1 | 100005 | 2 | 08MAY2018 | $-1,200.00 | . | . | . | . | . |
2 | 2 | 100005 | 2 | 22MAY2018 | $-1,200.00 | -1200.00 | -1000.00 | -2000.00 | . | . |
30 | 1 | 100006 | 780 | 07MAY2018 | $-51.25 | . | . | . | . | . |
31 | 2 | 100006 | 780 | 08MAY2018 | $-112.25 | -51.25 | -1200.00 | -1200.00 | -1000.00 | -2000.00 |
32 | 3 | 100006 | 780 | 11MAY2018 | $-22.50 | -112.25 | -51.25 | -1200.00 | -1200.00 | -1000.00 |
33 | 4 | 100006 | 780 | 15MAY2018 | $-26.34 | -22.50 | -112.25 | -51.25 | -1200.00 | -1200.00 |
34 | 5 | 100006 | 780 | 17MAY2018 | $-100.65 | -26.34 | -22.50 | -112.25 | -51.25 | -1200.00 |
35 | 6 | 100006 | 780 | 23MAY2018 | $-69.30 | -100.65 | -26.34 | -22.50 | -112.25 | -51.25 |
You don't need a macro at all.
data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;
proc sql noprint;
select max(count) - 1 into :maxcount trimmed from test;
quit;
data want;
set test;
array debits {&maxcount} debit_1-debit_&maxcount;
retain debits;
by accountnum;
prevdebit = lag(debit);
if first.accountnum
then do;
do index = 1 to &maxcount;
debits{index} = .;
end;
end;
else do;
do index = &maxcount to 2 by -1;
debits{index} = debits{index-1};
end;
debits{1} = prevdebit;
end;
drop prevdebit index;
run;
The only "macro" thing here is the macrovar maxcount, to dynamically determine the number of columns needed.
Hi, Below is the first 10 obs. When I run the code it lag_1 is referring to the current debit instead of the previous debit and the lag function is also excluded from the last account (10006) altogether: How do you fix this?
thanks
Sally
count accountnum trandate Debit lag_1 lag_2 lag_3 lag_4 lag_5
1 | 100001 | 30APR2018 | $-2,000.00 | -2000 | . | . | . | . |
1 | 100004 | 02MAY2018 | $-1,000.00 | . | . | -1000 | . | . |
1 | 100005 | 08MAY2018 | $-1,200.00 | -1200 | . | . | . | . |
2 | 100005 | 22MAY2018 | $-1,200.00 | -1200 | -1200 | . | . | . |
1 | 100006 | 07MAY2018 | $-51.25 | . | . | . | . | . |
2 | 100006 | 08MAY2018 | $-112.25 | . | . | . | . | . |
3 | 100006 | 11MAY2018 | $-22.50 | . | . | . | . | . |
4 | 100006 | 15MAY2018 | $-26.34 | . | . | . | . | . |
5 | 100006 | 17MAY2018 | $-100.65 | . | . | . | . | . |
6 | 100006 | 23MAY2018 | $-69.30 |
Please post a sample of your expected output for the input sample you provided. Thank you
Thanks, but I get this error message:
ERROR: Array subscript out of range at line 69 column 1.
Thanks its working now.
Is there anyway to instead of writing an exact number...eg (array lag_debit(3000);,)
it can do the exact number of lags = the last.accountnum count variable?
eg if there are 6 occurances of an account and the last accountnum has count=6 then do 6 lags and if an accountnum has count=3000 on the last.accountnum then do 3000 lags?
Sally
@Selli5 wrote:
Thanks its working now.
Is there anyway to instead of writing an exact number...eg (array lag_debit(3000);,)
it can do the exact number of lags = the last.accountnum count variable?
eg if there are 6 occurances of an account and the last accountnum has count=6 then do 6 lags and if an accountnum has count=3000 on the last.accountnum then do 3000 lags?
Sally
See this slight adaptation of my code:
data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;
proc sql noprint;
select max(count) - 1 into :maxcount trimmed from test;
quit;
data want;
set test;
array debits {&maxcount} debit_1-debit_&maxcount;
retain debits;
by accountnum;
prevdebit = lag(debit);
if first.accountnum
then do;
do index = 1 to &maxcount;
debits{index} = .;
end;
end;
else do;
do index = count - 1 to 2 by -1;
debits{index} = debits{index-1};
end;
debits{1} = prevdebit;
end;
drop prevdebit index;
run;
proc print data=want noobs;
run;
Result:
accountnum trandate debit count debit_1 debit_2 debit_3 debit_4 100001 21304 2000.0 1 . . . . 100002 21327 3000.0 1 . . . . 100002 21330 2500.5 2 3000.0 . . . 100002 21331 2100.0 3 2500.5 3000.0 . . 100002 21332 2400.0 4 2100.0 2500.5 3000.0 . 100002 21333 2500.1 5 2400.0 2100.0 2500.5 3000 100003 21327 2450.1 1 . . . . 100003 21328 2500.5 2 2450.1 . . .
Clearly the resolution of this problem is much more amenable to arrays than use of multiple lag functions.
Although @Kurt_Bremser's solution is straightforward, it is burdened with the time needed to execute loops of increasing size as the count variable grows. This is an excellent problem to consider a more sophisticated way (#2 below) of handling the procession of long arrays of values.
I suggest taking advantage of arrays in these additional ways:
data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;
proc sql noprint;
select max(count) - 1 into :maxcount trimmed from test;
quit;
%put &=maxcount;
data want;
set test;
by accountnum;
array debits {*} prevdebit debit_1-debit_&maxcount;
retain debits;
if _n_=1 then do;
addrprev=addrlong(prevdebit);
addr1=addrlong(debit_1);
end;
retain addr: ;
prevdebit = lag(debit);
if first.accountnum then call missing (of debits{*});
else call pokelong(peekclong(addrprev,8*(count-1)),addr1);
drop prevdebit addr: ;
run;
proc print data=want noobs;
run;
Notes:
I'd be very interested in the speed comparisons of this program versus the do-loop approach.
Edited additional notes:
It's possible the "call missing(of debits{*})" is also a bit inefficient. It will always reset to missing the entire array even if only 10% of it has non-missing values. That could be solved by (1) adding a temporary array of missing values, (2) getting lag of count, and (3) using call pokelong instead of call missing, as in:
data want;
set test;
by accountnum;
array debits {*} prevdebit debit_1-debit_&maxcount;
array misvals {0:&maxcount} _temporary_; /* New statement */
retain debits;
if _n_=1 then do;
addrprev=addrlong(prevdebit);
addr1=addrlong(debit_1);
addrmis=addrlong(misvals{0}); /* New statement */
end;
retain addr: ;
prevdebit = lag(debit);
prevcount = lag(count);
if first.accountnum then call pokelong(peekclong(addrmis,8*prevcount),addrprev); /* Modified statement */
else call pokelong(peekclong(addrprev,8*(count-1)),addr1);
drop prev: addr: ;
run;
proc print data=want noobs;
run;
While I strongly suggest that you study @mkeintz's suggested solution in-depth (to see how some quite arcane SAS functions can be used), I probably would not use it in practice/production.
Why?
So it all ends up as a clear case for Maxims 4 & 30. Only if a significant performance gain is observed (say, reducing runtime from 5 to 3 hours), should a more complex method be implemented.
Even if you shave off 90%, but only gain 9 seconds because the original run took just 10, I'd stay with the simple algorithm. Unless you know that the job has to scale up to much larger data in the future.
If you want the lag but not for the first record you can set it to missing using FIRST.
Lag is kind of tricky to understand, so I highly recommend reading this paper to understand how it works.
http://support.sas.com/resources/papers/proceedings09/055-2009.pdf
Making some massive assumptions, I think this is what you're looking for. Note the BY statement which may mean you need to sort the data.
data want;
set test;
by accountNum;
lag_debit=lag(debit);
if first.accountNum then lag_debit = . ;
run;
@Selli5 wrote:
Hello,
I would like to apply the lag function to the debit variable, the exact same number of times as the count variable (except for when there is only one accountnum, then there would be no lag). For some accounts the count variable will go into the thousands.
data test;
input accountnum trandate DATE9. debit 8.2 count ;
cards;
100001 30APR2018 2000.00 1
100002 23MAY2018 3000.00 1
100002 26MAY2018 2500.50 2
100002 27MAY2018 2100.00 3
100002 28MAY2018 2400.00 4
100002 29MAY2018 2500.10 5
100003 23MAY2018 2450.10 1
100003 24MAY2018 2500.50 2
;
run;
thanks
Sally
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.