BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
crawfe
Quartz | Level 8

Hi. I am stuck on a lag problem. Doing a Previous 6 month cum_sum for each unique SN-Name combination, I want to reset all the lag counts to zero after each new S/N-new Name combination to begin summing the running cum again. The code does that for First.SN/First.Name but the resets to 0 seem to be “temporary”; after the first.Name, the lag values from three rows above seem to continue on where they left off?? (red). A partial OUT table is shown below from a big input table. Is that enough to let you know what is wrong? Thanks.

(This pattern repeats).

 

/* calculate previous 6 months running sum */

data WANT; /* new table name */

 

DO UNTIL (Last.SN);

set HAVE;

 by SN  Name ;

      lag1=lag1(count);lag2=lag2(count); lag3=lag3(count);

     lag4=lag4(count); lag5=lag5(count); lag6=lag6(count);

      

        IF first.Name  THEN DO

          cum_sum=count; lag1=0;lag2=0;lag3=0;lag4=0;lag5=0;lag6=0;

                END;

               

    ELSE DO cum_sum=sum(count,of lag1-lag6);

  END;

   output;

end;

/* drop lag4-lag6;*/ /*drop cum_sum;*/

run;

 

 

SN             Name                     Date       Count       Lag1         Lag2           Lag3         Lag4         Lag5            Lag6        cum_sum

11075652  NameA                 03/2019     12              9                4                 6                7                3                  1                42

11075652  NameA                 04/2019     4                12              9                4                6                7                  3                45

11075652  NameA                 05/2019     3                4               12               9                4                6                  7                45

11075652  NameA                 06/2019     1                3                4               12               9                4                  6                39

11075652  NameB                 05/2019     1                0                0                0                0                0                  0                1

11075682  NameA                 07/2019     1                0                0                0                0                0                  0                1

11075682  NameC                 05/2018     2                0                0                0                0                0                  0                2

11075682  NameC                 06/2018     2                2                1                1                 1                3                 4               14

11075682  NameC                 07/2018     2                2                2                1                1                1                  3                12

11075682  NameC                 08/2018     0                2                2                2                1                1                  1                9

11075682  NameC                 09/2018     2                0                2                2                2                1                  1                10

1 ACCEPTED SOLUTION

Accepted Solutions
crawfe
Quartz | Level 8

Several of them did work exactly as desired. Thanks to all. As many of us are, I am

in a position of having to learn and produce at the same time. This really helps to see different

approaches. Thanks!

View solution in original post

15 REPLIES 15
PaigeMiller
Diamond | Level 26

These 6 motnhs running cumulative sums would be easy to create in PROC EXPAND, and then there would be no issue with resetting the lagged variables.

--
Paige Miller
PeterClemmensen
Tourmaline | Level 20

@crawfe do you have SAS/ETS license?

noling
SAS Employee

Looks to me like your code is doing exactly what you're telling it to do.

 

Do you expect the second "11075682, NameC" row to take the previous values? it's no longer first.Name there, so it won't reset. What do you expect? Do you expect the second "11075682, NameC" row to have only values from the first "11075682, NameC" row or the previous 6 records, regardless of SN and Name?


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

crawfe
Quartz | Level 8

sorry it wrapped....I wanted a new running list to start with 11075682/Name C.

First row in this group should have all the lags reset to 0 then start summing again from zero,

not reaching back to get lag values from a previous group (3, 4).

 

I don't have the ETS...

Tom
Super User Tom
Super User

Since you already have a DO loop around your SET why bother with LAG() function at all?

data WANT; 
do until(last.SN);
  set HAVE;
  by SN Name ;
  cum_sum=sum(count, of lag1-lag6);
  output;
  lag1=count;
  lag2=lag1;
  lag3=lag2;
  lag4=lag3;
  lag5=lag4;
  lag6=lag5;
end;
run;
crawfe
Quartz | Level 8

...i don't understand, you say don't bother with Lag then you include Lag in your code?

crawfe
Quartz | Level 8

Tom, your code sets all the Lags to 1 (=count(1st row)) even for SN1/Name1 row 2, which needs to have lags2-6 = 0

                                          Count Cum_sum Lag1 Lag 2 etc.

SN1         Name1 11/2017   1     1 . . . . . .
SN1        Name1 12/2017   1     7 1 1 1 1 1 1
SN1        Name1 01/2018   0     6 1 1 1 1 1 1
SN1        Name1 02/2018   0     0 0 0 0 0 0 0

Tom
Super User Tom
Super User

@crawfe wrote:

...i don't understand, you say don't bother with Lag then you include Lag in your code?


There is no LAG() function calls.

Note if you want code that works without debugging you need to post working sample data.

data have ;
  input SN $ Name $ Date Count ;
  informat date anydtdte.;
  format date yymm7.;
cards;
11075652  NameA     03/2019  12  
11075652  NameA     04/2019  4   
11075652  NameA     05/2019  3   
11075652  NameA     06/2019  1   
11075652  NameB     05/2019  1   
11075682  NameA     07/2019  1   
11075682  NameC     05/2018  2   
11075682  NameC     06/2018  2   
11075682  NameC     07/2018  2   
11075682  NameC     08/2018  0   
11075682  NameC     09/2018  2   
;

data WANT; 
do until(last.SN);
  set HAVE;
  by SN Name ;
  cum_sum=sum(count, of lag1-lag6);
  output;
  lag6=lag5;
  lag5=lag4;
  lag4=lag3;
  lag3=lag2;
  lag2=lag1;
  lag1=count;
end;
run;
Obs       SN       Name        Date    Count    cum_sum    lag1    lag2    lag3    lag4    lag5    lag6

  1    11075652    NameA    2019M03      12        12        .       .       .       .       .       .
  2    11075652    NameA    2019M04       4        16       12       .       .       .       .       .
  3    11075652    NameA    2019M05       3        19        4      12       .       .       .       .
  4    11075652    NameA    2019M06       1        20        3       4      12       .       .       .
  5    11075652    NameB    2019M05       1        21        1       3       4      12       .       .
  6    11075682    NameA    2019M07       1         1        .       .       .       .       .       .
  7    11075682    NameC    2018M05       2         3        1       .       .       .       .       .
  8    11075682    NameC    2018M06       2         5        2       1       .       .       .       .
  9    11075682    NameC    2018M07       2         7        2       2       1       .       .       .
 10    11075682    NameC    2018M08       0         7        2       2       2       1       .       .
 11    11075682    NameC    2018M09       2         9        0       2       2       2       1       .
hashman
Ammonite | Level 13

@Tom:

Agreed in more than one sense (see below). I answered the OP question about clearing the lags since this is what the OP had asked for; and I don't think that for problems of this kind using the LAGn function, let alone using multiple queues, is called for.

 

As to your code, note that though it doesn't use the LAGn function explicitly, it essentially emulates the working of LAG6 by moving all the elements of an artificial queue LAG1-LAG6 upward for every record - which is what LAG6 does behind the scenes (but about an order of magnitude more efficiently - I've tested). The advantage of using your artificial queue is that all of its items are accessible to the PDV, while with LAG6, the only item available is at the head of the queue. (That must be why the OP deemed the use of all the functions LAG1 through LAG6 necessary.)

 

Having said that, your code involves a lot of hard coding - the more, the wider the rolling sum window is. Therefore, the same concept can be expressed terser using an array, e.g. (I've set W=3 to allow at least one BY group be larger than W - otherwise we don't have a good test case):

data have ;                       
  input SN $ Name $ Date Count ;  
  informat date anydtdte. ;       
  format date yymm7. ;            
  cards ;                         
11075652  NameA  03/2019  12      
11075652  NameA  04/2019   4      
11075652  NameA  05/2019   3      
11075652  NameA  06/2019   1      
11075652  NameB  05/2019   1      
11075682  NameA  07/2019   1      
11075682  NameC  05/2018   2      
11075682  NameC  06/2018   2      
11075682  NameC  07/2018   2      
11075682  NameC  08/2018   0      
11075682  NameC  09/2018   2      
;                                 
run ;                             
                                  
%let w = 3 ;                      
                                  
data want ;                       
  do until (last.name) ;          
    set have ;                    
    by sn name ;                  
    array cl count lag1-lag&w ;   
    cum_sum = sum (of cl[*]) ;    
    output ;                      
    do _i_ = dim (cl) to 2 by -1 ;
      cl [_i_] = cl [_i_-1] ;     
    end ;                         
  end ;                           
run ;                             

Now, as I've repeatedly stated in the past (including in a similar thread, methinks originated by the same OP in the same vein), the whole idea of computing a rolling sum in a window of size W by calculating all the lags from 1 to W and summing them up is totally misguided. The reason is that to compute a rolling sum, we only need to add the leading item and subtract the item W+1 records back in the file. This way, computing a rolling sum for W=3 and W=100000 is equally efficient, while summing up 3 lag items vs 100000 lag items - not to mention getting them all - is all but. In this case, the leading item is current COUNT, so all we need is the value of COUNT W+1 records back in the file - and that only in the case when the size of the BY group exceeds W+1. That item can be fetched using the LAGw function, e.g.:

data want ;                                   
  do until (last.name) ;                      
    set have ;                                
    by sn name ;                              
    link lag ;                                
    cum_sum = sum (cum_sum, count, - _sub) ;  
    output ;                                  
  end ;                                       
  count = 0 ;                                 
  do _n_ = 1 to &w + 1 ;                      
    link lag ;                                
  end ;                                       
  return ;                                    
  lag: _sub = sum (lag%eval(&w+1)(count), 0) ;
  return ;                                    
run ;                                         

However, again, using the LAGw function here, even just one, is far from optimal since its internal queue needs to be cleared before/after each BY group, which for large W means a lot of overhead. It's much simpler and less onerous to fetch the item W+1 back by using a direct access SET, e.g.:

data want (drop = _:) ;                                
  do _n_ = 1 by 1 until (last.name) ;                  
    set have curobs = q ;                              
    by sn name ;                                       
    if _n_ > &w + 1 then do ;                          
      p = q - (&w + 1) ;                               
      set have (keep=count rename=count=_c) point = p ;
    end ;                                              
    cum_sum = sum (cum_sum, count, - sum (_c, 0)) ;    
    output ;                                           
  end ;                                                
run ;                                                  

In principle, there can be situations where we need all the lag values (though I can hardly fancy what for - surely not for computing a rolling sum, as shown above). In this case, using multiple LAG function queues is utterly misguided, too, because a simply array can do much more efficiently:

%let w = 3 ;                                                                    
                                                                                
proc sql noprint ;                                                              
  select max (q) into :q from (select count (*) as q from have group sn, name) ;
quit ;                                                                          
                                                                                
data want ;                                                                     
  array q [-&w:&q] _temporary_ ;                                                
  do _n_ = 1 by 1 until (last.name) ;                                           
    set have ;                                                                  
    by sn name ;                                                                
    q[_n_] = count ;                                                            
    array ll lag1-lag&w ;                                                       
    do over ll ;                                                                
      ll = q[_n_-_i_] ;                                                         
    end ;                                                                       
    output ;                                                                    
  end ;                                                                         
run ;                                                                           

Note that by the nature of the algorithm, the array doesn't need to be reinitialized before each BY group because every new BY group just overwrites the requisite number of array items. However, the need to pre-process the input file to size up the array is a bit off-putting. That can be fixed by merely setting the upper bound to something like 1000000 - but that means making the assumption that no by-group is larger. A better solution is using the hash object, as it doesn't require making any assumptions about data as long as memory is plentiful enough for the largest BY-group (though the code gets a little bit more verbose than with the array):

%let w = 3 ;                         
                                     
data want (drop = _:) ;              
  if _n_ = 1 then do ;               
    dcl hash h () ;                  
    h.definekey  ("_n_") ;           
    h.definedata ("_c") ;            
    h.definedone () ;                
  end ;                              
  do _n_ = 1 by 1 until (last.name) ;
    set have (rename=count=_c) ;     
    by sn name ;                     
    count = _c ;                     
    h.replace() ;                    
    array v lag1-lag&w ;             
    do over v ;                      
      _c = . ;                       
      _iorc_ = h.find (key:_n_-_i_) ;
      v = _c ;                       
    end ;                            
    output ;                         
  end ;                              
run ;                                

Here again, by the nature of the algorithm, essentially emulating the array approach by calling the REPLACE method rather than ADD, there's no need to clear the hash for each BY group. If ADD were used, the CLEAR method would have to be called before (or after) the DOW loop. One technical subtlety is that if the FIND method fails when (_n_ - _i_)  < 1, _C remains missing, which is what we want. A little bit of efficiency can be added by calling FIND only when (_n_ - _i_)  > 0.

 

Kind regards

Paul D.          

hashman
Ammonite | Level 13

@crawfe:

When you use the LAGn function, you need to understand its nature. To recap:

-- LAGn is a queue of N items in memory occupying <item length>*N bytes. If the queue is numeric, <item length> = 8.  

-- Every time LAGn is called for the same allocated queue, the item in the front of the queue is ejected (called dequeueing), and the value of the argument enters the rear of the queue (called enqueueing), displacing the rest of the items 1 position towards the front. 

-- "For the same allocated queue" means that each time the compiler sees another reference to LAGn, it allocates a separate queue with N items. Thus, 

 

x = lag (x) ;

x = lag (x) ;

 

is not at all the same as:

 

do i = 1 to 2 ;

  x = lag (x) ;

end ;

 

This is because in the former case, the compiler has seen 2 LAG references and organized 2 separate, completely independent, queues. Thus, the first x=lag(x) causes the dequeing and enqueueing only in the first LAG queue, and the second x=lag(x) does the same only for the second queue. In the latter case, the compiler sees only one LAG reference and therefore organizes a single LAG queue, so that each time the DO loop iterates, the dequeing and enqueueing occur in the same, single, LAG queue.   

-- Thus, a LAGn queue cannot be "cleared up" by doing anything with the variables, to which a call to the LAGn function assigns the dequeued value. It can be cleared up only by calling LAGn for the same queue (i.e. in a loop) N times, so that the item currently in the back of the queue is moved forward to the front of it and gets dequeued. The LAGn argument used in this action should be given the value to which you want the queue reinitialized - for example, a missing value or zero. This way, when after that you call LAGn again to create your assigned lag values, the first item dequeued will be that value.

 

After these prelim notes, it should be clear how to clear your queues before each BY group:

data have ;                              
  input sn count ;                       
  cards ;                                
11075652  12                             
11075652   4                             
11075652   3                             
11075652   1                             
11075652   1                             
11075682   1                             
11075682   2                             
11075682   2                             
11075682   2                             
11075682   0                             
11075682   2                             
run ;                                    
                                         
data want ;                              
  if 0 then set have ; /*keep the original variable order*/                   
  count = 0 ;          /*you want to initialize LAGs to 0*/                   
  do _n_ = 1 to 6 ;    /*loop 6 times, as LAG6 is longest*/                  
    link lag ;         /*use LINK to have the compiler see each LAGn just once*/                  
  end ;                                  
  do until (last.sn) ;                   
    set have ;                           
    by sn ;                              
    link lag ;         /*use LINK to have the compiler see each LAGn just once*/                           
    cum_sum = sum (count, of lag1-lag6) ;
    output ;                             
  end ;                                  
  return ;                               
    lag: lag1 = lag1 (count) ;           
         lag2 = lag2 (count) ;           
         lag3 = lag3 (count) ;           
         lag4 = lag4 (count) ;           
         lag5 = lag5 (count) ;           
         lag6 = lag6 (count) ;           
  return ;                               
run ;                                    

This way, the LAGn variables will remain 0 for the first N records in each BY group, as they should. Alternatively, you can reinitialize the queues with missing values - it won't affect CUM_SUM, but  in this case, the LAGn variables will remain missing for the first N records in each BY group, and coding will be a bit terser:

data want ;                              
  do until (last.sn) ;                   
    set have ;                           
    by sn ;                              
    link lag ;                           
    cum_sum = sum (count, of lag1-lag6) ;
    output ;                             
  end ;                                  
  count = . ;                            
  do _n_ = 1 to 6 ;                      
    link lag ;                           
  end ;                                  
  return ;                               
    lag: lag1 = lag1 (count) ;           
         lag2 = lag2 (count) ;           
         lag3 = lag3 (count) ;           
         lag4 = lag4 (count) ;           
         lag5 = lag5 (count) ;           
         lag6 = lag6 (count) ;           
  return ;                               
run ;                                    

Kind regards

Paul D.         

crawfe
Quartz | Level 8

Thanks! That is a Lag description I wouldn't find anywhere else!

 

As asked for above, here is a data set that is typical for my input table.

Three levels of variables/fields (DateMY is in mmyys7.).

The lag sums need to be calculated for each SN-Name combination, increasing by date,

SO after 101-Name1 lag sums are complete, the lag1-6 queue needs to go back to 0 for lag(n). Then back to zero

after 101-Name2, 201-Name3, etc. The data would be sorted as shown.

 

SN  Name  DateMY  Count  (WANT-Lag cum_sum)

101 Name1 11/2017 12               12
101 Name1 12/2017 4                 16
101 Name1 02/2018 3                 19
101 Name1 04/2018 3                 22
101 Name1 05/2018 7                 29
101 Name1 07/2018 8                 37
101 Name2 03/2018 1                 1
101 Name2 04/2018 1                 2
201 Name3 03/2018 5                 5
201 Name3 05/2018 6                11
202 Name1 01/2018 1                 1
202 Name1 03/2018 2                 3
202 Name1 05/2018 7                10
202 Name2 02/2018 2                 2
202 Name2 03/2018 0                 2
202 Name2 05/2018 2                 4

 

Tom
Super User Tom
Super User

If you want to reset the sum when the SN plus NAME combination changes then make sure to setup the code to do it that way and not just when SN changes.

do until(last.name);
  set have;
  by sn name;
  ...
noling
SAS Employee

I try to avoid using the lag function due to the complexities that @hashman describes. Here's an alternative method using a RETAIN statement:

 

data have ;
  input SN $ Name $ Date Count ;
  informat date anydtdte.;
  format date yymm7.;
cards;
11075652  NameA     03/2019  12  
11075652  NameA     04/2019  4   
11075652  NameA     05/2019  3   
11075652  NameA     06/2019  1   
11075652  NameB     05/2019  1   
11075682  NameA     07/2019  1   
11075682  NameC     05/2018  2   
11075682  NameC     06/2018  2   
11075682  NameC     07/2018  2   
11075682  NameC     08/2018  0   
11075682  NameC     09/2018  2   
;
run;

data want;
	set have;
	retain lag1-lag6 count_retain 0;
	by SN Name;

	*clear retained values for each new group;
	if first.name then do;
		lag1=0; lag2=0; lag3=0; lag4=0; lag5=0; lag6=0; count_retain=0;
	end;
	*pass down retained values;
	lag6=lag5;
	lag5=lag4;
	lag4=lag3;
	lag3=lag2;
	lag2=lag1;
	lag1=count_retain;
	count_retain=count;

	cum_sum=sum(count,of lag1-lag6);
	drop count_retain;
run;

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Astounding
PROC Star

Among all the suggestions you have received, there must be one that does what you want.  Just in case you are not yet convinced of that, here is how I would approach the problem:

 

data want;
   back6 = back5;
   back5 = back4;
   back4 = back3;
   back3 = back2;
   back2 = back1;
   back1 = count;
   retain back1-back5;
set have; by SN name; if first.name then call missing(of back1-back6); cum_sum = sum(count, of back1-back6); run;

 

When using the SUM function, there is no need to re-set values to zero.  It's easier to re-set them to missing instead. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1437 views
  • 3 likes
  • 7 in conversation