@Tom:
Agreed in more than one sense (see below). I answered the OP question about clearing the lags since this is what the OP had asked for; and I don't think that for problems of this kind using the LAGn function, let alone using multiple queues, is called for.
As to your code, note that though it doesn't use the LAGn function explicitly, it essentially emulates the working of LAG6 by moving all the elements of an artificial queue LAG1-LAG6 upward for every record - which is what LAG6 does behind the scenes (but about an order of magnitude more efficiently - I've tested). The advantage of using your artificial queue is that all of its items are accessible to the PDV, while with LAG6, the only item available is at the head of the queue. (That must be why the OP deemed the use of all the functions LAG1 through LAG6 necessary.)
Having said that, your code involves a lot of hard coding - the more, the wider the rolling sum window is. Therefore, the same concept can be expressed terser using an array, e.g. (I've set W=3 to allow at least one BY group be larger than W - otherwise we don't have a good test case):
data have ;
input SN $ Name $ Date Count ;
informat date anydtdte. ;
format date yymm7. ;
cards ;
11075652 NameA 03/2019 12
11075652 NameA 04/2019 4
11075652 NameA 05/2019 3
11075652 NameA 06/2019 1
11075652 NameB 05/2019 1
11075682 NameA 07/2019 1
11075682 NameC 05/2018 2
11075682 NameC 06/2018 2
11075682 NameC 07/2018 2
11075682 NameC 08/2018 0
11075682 NameC 09/2018 2
;
run ;
%let w = 3 ;
data want ;
do until (last.name) ;
set have ;
by sn name ;
array cl count lag1-lag&w ;
cum_sum = sum (of cl[*]) ;
output ;
do _i_ = dim (cl) to 2 by -1 ;
cl [_i_] = cl [_i_-1] ;
end ;
end ;
run ;
Now, as I've repeatedly stated in the past (including in a similar thread, methinks originated by the same OP in the same vein), the whole idea of computing a rolling sum in a window of size W by calculating all the lags from 1 to W and summing them up is totally misguided. The reason is that to compute a rolling sum, we only need to add the leading item and subtract the item W+1 records back in the file. This way, computing a rolling sum for W=3 and W=100000 is equally efficient, while summing up 3 lag items vs 100000 lag items - not to mention getting them all - is all but. In this case, the leading item is current COUNT, so all we need is the value of COUNT W+1 records back in the file - and that only in the case when the size of the BY group exceeds W+1. That item can be fetched using the LAGw function, e.g.:
data want ;
do until (last.name) ;
set have ;
by sn name ;
link lag ;
cum_sum = sum (cum_sum, count, - _sub) ;
output ;
end ;
count = 0 ;
do _n_ = 1 to &w + 1 ;
link lag ;
end ;
return ;
lag: _sub = sum (lag%eval(&w+1)(count), 0) ;
return ;
run ;
However, again, using the LAGw function here, even just one, is far from optimal since its internal queue needs to be cleared before/after each BY group, which for large W means a lot of overhead. It's much simpler and less onerous to fetch the item W+1 back by using a direct access SET, e.g.:
data want (drop = _:) ;
do _n_ = 1 by 1 until (last.name) ;
set have curobs = q ;
by sn name ;
if _n_ > &w + 1 then do ;
p = q - (&w + 1) ;
set have (keep=count rename=count=_c) point = p ;
end ;
cum_sum = sum (cum_sum, count, - sum (_c, 0)) ;
output ;
end ;
run ;
In principle, there can be situations where we need all the lag values (though I can hardly fancy what for - surely not for computing a rolling sum, as shown above). In this case, using multiple LAG function queues is utterly misguided, too, because a simply array can do much more efficiently:
%let w = 3 ;
proc sql noprint ;
select max (q) into :q from (select count (*) as q from have group sn, name) ;
quit ;
data want ;
array q [-&w:&q] _temporary_ ;
do _n_ = 1 by 1 until (last.name) ;
set have ;
by sn name ;
q[_n_] = count ;
array ll lag1-lag&w ;
do over ll ;
ll = q[_n_-_i_] ;
end ;
output ;
end ;
run ;
Note that by the nature of the algorithm, the array doesn't need to be reinitialized before each BY group because every new BY group just overwrites the requisite number of array items. However, the need to pre-process the input file to size up the array is a bit off-putting. That can be fixed by merely setting the upper bound to something like 1000000 - but that means making the assumption that no by-group is larger. A better solution is using the hash object, as it doesn't require making any assumptions about data as long as memory is plentiful enough for the largest BY-group (though the code gets a little bit more verbose than with the array):
%let w = 3 ;
data want (drop = _:) ;
if _n_ = 1 then do ;
dcl hash h () ;
h.definekey ("_n_") ;
h.definedata ("_c") ;
h.definedone () ;
end ;
do _n_ = 1 by 1 until (last.name) ;
set have (rename=count=_c) ;
by sn name ;
count = _c ;
h.replace() ;
array v lag1-lag&w ;
do over v ;
_c = . ;
_iorc_ = h.find (key:_n_-_i_) ;
v = _c ;
end ;
output ;
end ;
run ;
Here again, by the nature of the algorithm, essentially emulating the array approach by calling the REPLACE method rather than ADD, there's no need to clear the hash for each BY group. If ADD were used, the CLEAR method would have to be called before (or after) the DOW loop. One technical subtlety is that if the FIND method fails when (_n_ - _i_) < 1, _C remains missing, which is what we want. A little bit of efficiency can be added by calling FIND only when (_n_ - _i_) > 0.
Kind regards
Paul D.
... View more