Solved: How to reset the cum_sum counter in a Lag procedure

crawfe · Posted 07-26-2019 03:36 PM

Hi. I couldn't think of a good title!

I combined a fill-in all dates procedure and then I sum the previous 3 months values.

How can I get the cum_sum to start fresh at each new Name? Thanks!

data HAVE;
input NAME $ DATE : anydtdte10. VALUE ;
format date mmyys7.;
cards;
Name1 11/2018 4
Name1 01/2019 3
Name1 04/2019 2
Name1 10/2019 1
Name1 11/2019 3
Name2 11/2018 4
Name2 01/2019 3
Name2 04/2019 2
Name2 05/2019 1
Name2 07/2019 5
;

data want; /* new table name */
merge have have(firstobs=2 drop=value rename=(name=_name date=_date));
if name=_name then do while(date<_date);
output;
date=intnx('month',date,1);
value=0;
end;
else output;
drop _:;
run;

/* 3 month previous sum */

data ThreeMoSum1;
set want;
lag1=lag1(value);
lag2=lag2(value);
lag3=lag3(value);
/* sum the three previous values for this month's date */
cum_sum=sum(value,of lag: ) - value;
drop lag:;
run;

OUTPUT: Name Date Value cum_sum

Name1 11/2018 4 0
Name1 12/2018 0 4
Name1 01/2019 3 4
Name1 02/2019 0 7
Name1 03/2019 0 3
Name1 04/2019 2 3
Name1 05/2019 0 2
Name1 06/2019 0 2
Name1 07/2019 0 2
Name1 08/2019 0 0
Name1 09/2019 0 0
Name1 10/2019 1 0
Name1 11/2019 3 1
Name2 11/2018 4 4
Name2 12/2018 0 8 should be 4 =4
Name2 01/2019 3 7 should be 4 + 0 = 4
Name2 02/2019 0 7 ok 4 + 0 + 3
Name2 03/2019 0 3 ok 0 + 3 + 0
Name2 04/2019 2 3
Name2 05/2019 1 2
Name2 06/2019 0 3
Name2 07/2019 5 3

crawfe · Posted 07-27-2019 11:35 AM

Thanks! It's interesting to see different approaches. I tend to go for more complicated solutions

to simple ideas.

View solution in original post

PaigeMiller · Posted 07-26-2019 05:01 PM

data threemosum1;
    set want;
    by name;
    lag1=lag1(value);
    lag2=lag2(value);
    lag3=lag3(value);
    if first.name then do;
        lag1=0;
        lag2=0;
        lag3=0;
    end;

--
Paige Miller

Tom · Posted 07-26-2019 06:28 PM

You need to reset the lagged variables.

You can avoid needing to reset them by using a DO loop around the SET statement just making the lagged values by using simple assignment statements instead of LAG() function.

/* 3 month previous sum */

data ThreeMoSum1;
  do until(last.name);
    set want ;
    by name;
    cum_sum=sum(0,of lag1-lag3);
    output;
    lag3=lag2;
    lag2=lag1;
    lag1=value;
  end;
run;

crawfe · Posted 07-27-2019 11:35 AM

Thanks! It's interesting to see different approaches. I tend to go for more complicated solutions

to simple ideas.

hashman · Posted 07-27-2019 01:57 AM

@crawfe:

I don't know what your "fill-in all dates procedure" is. However, before I offer you code, let me make a couple of notes of algorithmic nature:

Any "fill-in" problem of the sort you've presented here is best addressed using key-indexing of some kind. Here, key-indexing into a temp array is used. Not only it makes the solution logically simpler than dealing with some looking backward or forward logic, it rids of the need to use lags to boot.
Any method of computing a rolling sum within a window of width W that relies on summing up all items in the window is inefficient and wastes resources unless W<=2. This is because all you need to do to compute the rolling sum is to add the leading value and subtract the value W items behind; in other words, it requires no more than 2 sum operations regardless of W. Fancy the difference between the mere 2 acts of adding and subtracting and the need to sum up all the items in the window of W=100000, say.

data have ;                                                               
  input name $ date :anydtdte7. value ;                                   
  format date yymms7.;                                                    
cards;                                                                    
NAME1 11/2018 4                                                           
NAME1 01/2019 3                                                           
NAME1 04/2019 2                                                           
NAME1 10/2019 1                                                           
NAME1 11/2019 3                                                           
NAME2 11/2018 4                                                           
NAME2 01/2019 3                                                           
NAME2 04/2019 2                                                           
NAME2 05/2019 1                                                           
NAME2 07/2019 5                                                           
run ;                                                                     
                                                                          
data want (drop = _:) ;                                                   
  retain _d0 "01jan1582"d _u "mon" _p 1 _w 3 ;                            
  array vv [99999] _temporary_ ;                                          
  do until (last.name) ;                                                  
    set have ;                                                            
    by name ;                                                             
    if first.name then _sd = date ;                                       
    vv[intck (_u, _d0, date)] = value ;                                   
  end ;                                                                   
  do _m = intck (_u, _d0, _sd) to intck (_u, _d0, date) ;                 
    date = intnx (_u, _d0, _m) ;                                          
    value = sum (vv[_m], 0) ;                                             
    cum_sum = sum (cum_sum, vv[_m - _p], - sum (vv[_m - _p - _w], 0), 0) ;
    output ;                                                              
  end ;                                                                   
  call missing (of vv[*]) ;                                               
run ;

Program notes:

In the RETAIN statement, _P=1 is the number of records prior to the current from which you need to start the summation; and _W=3 is the rolling sum window width. The rest are simply constants to simplify the code: _D0="01jan1582"d is the earliest possible SAS date, and _U="mon" is the time interval unit for the functions INTCK and INTNX.
The zeroes plugged into the SUM function calls prevent the program from generating missing values and the corresponding log notes.
99999 is a "big" number number of months since _D0 guaranteed to reliably exceed any reasonable SAS date any program might need to handle (it corresponds to 9915/04).
It is assumed that the input file is sorted by (name,date).

Kind regards

Paul D.

Tom · Posted 07-27-2019 10:29 AM

Actually it only assumes (or needs) that the input data is sorted by NAME. The output is sorted by NAME and DATE.

hashman · Posted 07-27-2019 01:53 PM

@Tom:

I said "by (NAME, DATE)" because my code relies on the DATE order within NAME to get the min date _SD at the first record and max - at the last. Foregoing this assumption would mean just 2 extra lines of code and initializing _SD to something big before the DOW loop. Foregoing the assumption that the input is sorted at all and still attaining the goal in a single step is also possible but the code would look quite differently and get much more involved.

Kind regards

Paul D.

Tom · Posted 07-27-2019 02:29 PM

In that case add DATE to the BY statement so that the requirement is clearer and so that SAS will stop the step if the data violates the sort criteria.

hashman · Posted 07-27-2019 02:51 PM

@Tom:

"... add DATE to the BY statement so that the requirement is clearer ..."

I deem this reason insufficient to increase the overhead of unused BY variables.

"... and so that SAS will stop the step if the data violates the sort criteria."

However, this is a sage advice, and your point is well taken.

novinosrin · Posted 07-27-2019 05:48 PM

@hashman Guru, Always a privilege to direct your all your notes to my personal folder and you know I sincerely follow your notes although that shrunk like that of dynamic hash object unlike your usual style. 🙂 Anyways nice to keep it short, so i can research more and learn rather than being spoon fed.

OK, Can you clarify a couple of my "fill-in" concerns

1. Comparing with Look ahead, the Key based array search, find the min, max and loop. Requires one pass of the By group and the loop the min to max interval. Cool and neat. May we(on behalf of community) get your insight to performance considering it seems "more" than one pass. Am i missing something?

2. Much of the literature claims temp array is faster than a hash approach, Is it always the case? Nonetheless, syntactically I find Hash easier(thanks to your "Book") for the reason needless to size up the array beforehand as it dynamically grows and shrinks and also convenient to visualize the pointer's move on a method call which is exquisitely described in the book. Well i suppose an extra pass of sizing is perhaps an overhead however not in this case(thread) of course as your initializing to 99999 is enormous.

3. So, loading the HAVE in hash with date as key, a DOW pass to find min,max and a mere loop from min to max with an (rc=h.find()=0*value) boolean seems a breeze than an array.

Well all the above being said, I am sure a Guru has his own reasons to why and what of the subject. I will look forward to your valuable time in receiving that privilege yet again.

Best Regards!

PS

For those who are interested in Hashman's book and other books. Safarionline annual subscription has discount prices at $199. Earlier it was $398

hashman · Posted 07-27-2019 10:31 PM

Expert @novinosrin:

1. Just a little bit more. Scanning a numeric array is about 5 times faster than reading a SAS data set with 1 numeric variable sequentially. Hence, when a BY group is first read in from disc and then post-processed using an array, the latter doesn't quite add what would amount to the second read.

2. If you mean sequential read (i.e. the enumeration operation), then yes, the temp arrays are quicker by quite a bit. Judge for yourself:

data _null_ ;                          
  dcl hash h() ;                       
  h.definekey ("k") ;                  
  h.definedone () ;                    
  dcl hiter hi ("h") ;                 
  do k = 1 to 1e6 ;                    
    h.add() ;                          
  end ;                                
  t = time() ;                         
  do r = 1 to 100 ;                    
    do while (hi.next() = 0) ;         
    end ;                              
  end ;                                
  hash_time = time() - t ;             
  array a [1000000] _temporary_ ;      
  t = time() ;                         
  do r = 1 to 100 ;                    
    do _n_ = 1 to dim (a) ;            
      k = a[_n_] ;                     
    end ;                              
  end ;                                
  arry_time = time() - t ;             
  put hash_time=z6.3 / arry_time=z6.3 ;
run ;

And the SAS log (which never lies) reports:

hash_time=15.105
arry_time=01.024

Remember, though, that the hash object's claim to fame is not sequential performance. It's quite obvious that traversing an AVL tree is a far more laborious task than scanning through an array (which requires nothing more than adding a fixed number to the address of the first array item). Though it's nice and convenient that SAS has included the enumeration operation with the hash object, its main strength lies is super-quick random access to data in memory by an arbitrary (i.e. simple or mixed type composite) key. Besides, the functionality of the hash object is way wider that that of an array, and matching even a fraction of it using arrays requires custom coding of the sort most SAS users are neither capable of nor should be asked to do - which is why the hash object was created in the first place.

On the other hand, for this particular type of task, a key-indexed temp array offers the advantage of algorithmic simplicity besides the speed. Note that by key-indexing the month numbers relative to a fixed point in the past (which never has to be changed, so using it makes no assumptions about the data), you automatically get all the month numbers in between as null-filled slots, and outputting them is no more complex than merely scanning through the array. If you stored the dates in a hash table instead, you'd still have to execute some kind of in-between logic. Of course, you could alternatively pre-fill the table with all months from min to max, then update it with the months actually available in the BY group and iterated through the table to get essentially what the array does. Surely, it has the advantage of not having to use the "big enough" type logic, but the latter is a small price to pay for a much simpler approach, especially since using 99999 in this case (heck, you could use 999999 if it should strike your fancy) is absolutely bulletproof.

3. Methinks I've already addressed that.

Kind regards

Paul D.

p.s. Thanks for the plug.

novinosrin · Posted 07-27-2019 10:57 PM

Guru, Classic and diligence at best. Thank you * 1e6 or never can thank enough for your time. While I have thoroughly comprehended the explanation, all I can possibly say is your detailed explanation is "sublime and elegant". I mean it. Simply Priceless!

Have a good night and take care!

How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to reset the cum_sum counter in a Lag procedure

SAS Innovate 2025: Call for Content

Classroom Training Available!