Re: Clear the LAG queue

craigsrsen · Posted 08-08-2018 04:38 PM

I have a dataset that I am processing in BY-groups and am using multiple LAG functions, e.g. a LAG1 variable, a LAG2 variable, etc.

When I reach a new BY group, I want to clear the entire LAG queue from memory and start over. If I use FIRST. I can clear lagged values from the first record in a BY group, but on the second record of the BY group I am still pulling in lagged values from the previous BY group.

I could probably devise a work-around but I am wondering if there is a simple command to clear the entire LAG queue.

Thanks

ballardw · Posted 08-08-2018 05:38 PM

@craigsrsen wrote:

I have a dataset that I am processing in BY-groups and am using multiple LAG functions, e.g. a LAG1 variable, a LAG2 variable, etc.

When I reach a new BY group, I want to clear the entire LAG queue from memory and start over. If I use FIRST. I can clear lagged values from the first record in a BY group, but on the second record of the BY group I am still pulling in lagged values from the previous BY group.

I could probably devise a work-around but I am wondering if there is a simple command to clear the entire LAG queue.

Thanks

No simple way to clear the lag queue entirely.

If you show the code you are currently attempting we can make suggestions that may work in this case.

Likely it is going to involve a counter within the by group and conditional assignment to a different variable than a simple lagn()

hashman · Posted 08-08-2018 06:01 PM

@craigsrsen:

It cannot be done after the BY group, and here's why. Suppose you have data as below and attempt to clean the queues after each BY group by repeatedly dequeueing in a loop up to the length of the longest lag:

data test ;            
  input id v ;         
  cards ;              
2 1                    
2 2                    
3 1                    
3 2                    
3 3                    
;                      
run ;                  
                       
data _null_ ;          
  do until (last.id) ; 
    set test ;         
    by id ;            
    x1 = lag1 (v) ;    
    x2 = lag2 (v) ;    
    put _n_= x1= x2= ; 
  end ;                
  do _n_ = 1 to 2 ;    
    x = lag1 (.) ;     
    x = lag2 (.) ;     
  end ;                
run ;

What you'll see in the log will be (note that _N_ below is the BY group number):

_N_=1 x1=. x2=.
_N_=1 x1=1 x2=.
_N_=2 x1=2 x2=1
_N_=2 x1=1 x2=2
_N_=2 x1=2 x2=1

Obviously, nothing has been cleaned up. The reason is that when the compiler sees LAG1 and LAG2 for the second time, it allocates totally new queues having nothing whatsoever to do with the queues referenced in the BY group DO loop, and so the "clean-up" DO loop executes against the second set of the LAGs. The only two ways I see to work around it are:

1. Organize your queues using arrays or hash tables (it's actually pretty easy).

2. To the end of each BY group, add as many dummy records as the length of the longest lag. This way, the compiler will see the lags only once but the dummy records will dequeue them before the next BY group appears.

WRT #2, for example:

%let maxlag = 2 ; * longest lag ;        
                                         
data dummy / view = dummy ;              
  set test ;                             
  by id ;                                
  output ;                               
  v = . ;                                
  if last.id then do _n_ = 1 to &maxlag ;
    output ;                             
  end ;                                  
run ;                                    
                                         
data _null_ ;                            
  do until (last.id) ;                   
    set dummy ;                          
    by id ;                              
    x1 = lag1 (v) ;                      
    x2 = lag2 (v) ;                      
    put _n_= x1= x2= ;                   
  end ;                                  
run ;

In this case, you'll see in the log:

_N_=1 x1=. x2=.
_N_=1 x1=1 x2=.
_N_=1 x1=2 x2=1
_N_=1 x1=. x2=2
_N_=2 x1=. x2=.
_N_=2 x1=1 x2=.
_N_=2 x1=2 x2=1
_N_=2 x1=3 x2=2
_N_=2 x1=. x2=3

as you'd expect since the lags have been dequeued by calling them for each dummy record. In your program, you'll have to include some conditional statement to separate your processing of the real records from the dummy ones - just make sure it doesn't include any reference to LAG1 or LAG2 ;).

Paul D.

ChrisNZ · Posted 08-08-2018 08:11 PM

This is probably how I would implement a proper-queue logic:

proc sort data=SASHELP.CLASS out=CLASS; 
  by SEX NAME;
run;
data CLASS2;
 set CLASS;
  by SEX;
  N1=lag1(NAME) ;
  N2=lag2(NAME) ;
  if SEX ne lag1(SEX) then N1=' ';
  if SEX ne lag2(SEX) then N2=' ';
run;

It's easy to read and to debug.

High-Performance SAS Coding - Third Edition

craigsrsen · Posted 08-10-2018 05:04 PM

Thanks Chris. This is basically the solution I ended up going with. It was pretty straightforward.

Tom · Posted 08-08-2018 08:37 PM

You can just use RETAIN instead of LAG(). Then it is easy to just use CALL MISSING();

data want ;
   lag5=lag4;
   lag4=lag3;
   lag3=lag2;
   lag2=lag1;
   lag1=x;
   retain lag1-lag5 ;
   set have ;
   by id;
   if first.id then call missing(of lag1-lag5);

   .... do stuff  ...

run;

hashman · Posted 08-08-2018 11:26 PM

@Tom:

Exactly. You're essentially organizing a queue as an expanded (unrolled) array. As such, it can be coded as an unexpanded array as well; and:

- if it is non-temporary, CALL MISSING is not needed

- and if the DoW loop is used, RETAIN isn't needed, either

E.g.:

data have ;                                               
  do ID = "A", "B" ;                                      
    do x = 1 to 9 ;                                       
      output ;                                            
    end ;                                                 
  end ;                                                   
run ;                                                     
                                                          
data want (drop = _:) ;                                   
  do until (last.id) ;                                    
    set have ;                                            
    by ID ;                                               
    array xq [*] x _q1-_q5 ;                       
    do _i = dim (xq) to 2 by -1 ;                      
      xq [_i] = xq [_i - 1] ;                       
    end ;                                                 
    put ID= x= " q: " _q1-_q5 ;
    /* do stuff */                                        
  end ;                                                   
run ;

And the log will show:

ID=A x=1 q: 1 . . . .
ID=A x=2 q: 2 1 . . .
ID=A x=3 q: 3 2 1 . .
ID=A x=4 q: 4 3 2 1 .
ID=A x=5 q: 5 4 3 2 1
ID=A x=6 q: 6 5 4 3 2
ID=A x=7 q: 7 6 5 4 3
ID=A x=8 q: 8 7 6 5 4
ID=A x=9 q: 9 8 7 6 5
ID=B x=1 q: 1 . . . .
ID=B x=2 q: 2 1 . . .
ID=B x=3 q: 3 2 1 . .
ID=B x=4 q: 4 3 2 1 .
ID=B x=5 q: 5 4 3 2 1
ID=B x=6 q: 6 5 4 3 2
ID=B x=7 q: 7 6 5 4 3
ID=B x=8 q: 8 7 6 5 4
ID=B x=9 q: 9 8 7 6 5

Quod erat faciendum.

One little issue with this way of organizing a queue is the need to shift all the q-values one position backward for each record read from the file. However, that can be easily recoded by either (a) allocating a "big enough" array and merely adjusting index pointers instead of moving the values or (b) using a hash table with its advantage of dynamic sizing. (In fact, there's a section on organizing a hash queue in Chapter 10 of the book "Data Management Solutions Using SAS Hash Table Operations" just published by SAS Press and penned by Don Henderson and yours truly.)

Best

Paul D.

Classroom Training Available!