Hi. I couldn't think of a good title!
I combined a fill-in all dates procedure and then I sum the previous 3 months values.
How can I get the cum_sum to start fresh at each new Name? Thanks!
data HAVE;
input NAME $ DATE : anydtdte10. VALUE ;
format date mmyys7.;
cards;
Name1 11/2018 4
Name1 01/2019 3
Name1 04/2019 2
Name1 10/2019 1
Name1 11/2019 3
Name2 11/2018 4
Name2 01/2019 3
Name2 04/2019 2
Name2 05/2019 1
Name2 07/2019 5
;
data want; /* new table name */
merge have have(firstobs=2 drop=value rename=(name=_name date=_date));
if name=_name then do while(date<_date);
output;
date=intnx('month',date,1);
value=0;
end;
else output;
drop _:;
run;
/* 3 month previous sum */
data ThreeMoSum1;
set want;
lag1=lag1(value);
lag2=lag2(value);
lag3=lag3(value);
/* sum the three previous values for this month's date */
cum_sum=sum(value,of lag: ) - value;
drop lag:;
run;
OUTPUT: Name Date Value cum_sum
Name1 11/2018 4 0
Name1 12/2018 0 4
Name1 01/2019 3 4
Name1 02/2019 0 7
Name1 03/2019 0 3
Name1 04/2019 2 3
Name1 05/2019 0 2
Name1 06/2019 0 2
Name1 07/2019 0 2
Name1 08/2019 0 0
Name1 09/2019 0 0
Name1 10/2019 1 0
Name1 11/2019 3 1
Name2 11/2018 4 4
Name2 12/2018 0 8 should be 4 =4
Name2 01/2019 3 7 should be 4 + 0 = 4
Name2 02/2019 0 7 ok 4 + 0 + 3
Name2 03/2019 0 3 ok 0 + 3 + 0
Name2 04/2019 2 3
Name2 05/2019 1 2
Name2 06/2019 0 3
Name2 07/2019 5 3
Thanks! It's interesting to see different approaches. I tend to go for more complicated solutions
to simple ideas.
data threemosum1;
set want;
by name;
lag1=lag1(value);
lag2=lag2(value);
lag3=lag3(value);
if first.name then do;
lag1=0;
lag2=0;
lag3=0;
end;
You need to reset the lagged variables.
You can avoid needing to reset them by using a DO loop around the SET statement just making the lagged values by using simple assignment statements instead of LAG() function.
/* 3 month previous sum */
data ThreeMoSum1;
do until(last.name);
set want ;
by name;
cum_sum=sum(0,of lag1-lag3);
output;
lag3=lag2;
lag2=lag1;
lag1=value;
end;
run;
Thanks! It's interesting to see different approaches. I tend to go for more complicated solutions
to simple ideas.
I don't know what your "fill-in all dates procedure" is. However, before I offer you code, let me make a couple of notes of algorithmic nature:
data have ;
input name $ date :anydtdte7. value ;
format date yymms7.;
cards;
NAME1 11/2018 4
NAME1 01/2019 3
NAME1 04/2019 2
NAME1 10/2019 1
NAME1 11/2019 3
NAME2 11/2018 4
NAME2 01/2019 3
NAME2 04/2019 2
NAME2 05/2019 1
NAME2 07/2019 5
run ;
data want (drop = _:) ;
retain _d0 "01jan1582"d _u "mon" _p 1 _w 3 ;
array vv [99999] _temporary_ ;
do until (last.name) ;
set have ;
by name ;
if first.name then _sd = date ;
vv[intck (_u, _d0, date)] = value ;
end ;
do _m = intck (_u, _d0, _sd) to intck (_u, _d0, date) ;
date = intnx (_u, _d0, _m) ;
value = sum (vv[_m], 0) ;
cum_sum = sum (cum_sum, vv[_m - _p], - sum (vv[_m - _p - _w], 0), 0) ;
output ;
end ;
call missing (of vv[*]) ;
run ;
Program notes:
Kind regards
Paul D.
@Tom:
I said "by (NAME, DATE)" because my code relies on the DATE order within NAME to get the min date _SD at the first record and max - at the last. Foregoing this assumption would mean just 2 extra lines of code and initializing _SD to something big before the DOW loop. Foregoing the assumption that the input is sorted at all and still attaining the goal in a single step is also possible but the code would look quite differently and get much more involved.
Kind regards
Paul D.
In that case add DATE to the BY statement so that the requirement is clearer and so that SAS will stop the step if the data violates the sort criteria.
@Tom:
"... add DATE to the BY statement so that the requirement is clearer ..."
I deem this reason insufficient to increase the overhead of unused BY variables.
"... and so that SAS will stop the step if the data violates the sort criteria."
However, this is a sage advice, and your point is well taken.
@hashman Guru, Always a privilege to direct your all your notes to my personal folder and you know I sincerely follow your notes although that shrunk like that of dynamic hash object unlike your usual style. 🙂 Anyways nice to keep it short, so i can research more and learn rather than being spoon fed.
OK, Can you clarify a couple of my "fill-in" concerns
1. Comparing with Look ahead, the Key based array search, find the min, max and loop. Requires one pass of the By group and the loop the min to max interval. Cool and neat. May we(on behalf of community) get your insight to performance considering it seems "more" than one pass. Am i missing something?
2. Much of the literature claims temp array is faster than a hash approach, Is it always the case? Nonetheless, syntactically I find Hash easier(thanks to your "Book") for the reason needless to size up the array beforehand as it dynamically grows and shrinks and also convenient to visualize the pointer's move on a method call which is exquisitely described in the book. Well i suppose an extra pass of sizing is perhaps an overhead however not in this case(thread) of course as your initializing to 99999 is enormous.
3. So, loading the HAVE in hash with date as key, a DOW pass to find min,max and a mere loop from min to max with an (rc=h.find()=0*value) boolean seems a breeze than an array.
Well all the above being said, I am sure a Guru has his own reasons to why and what of the subject. I will look forward to your valuable time in receiving that privilege yet again.
Best Regards!
PS
For those who are interested in Hashman's book and other books. Safarionline annual subscription has discount prices at $199. Earlier it was $398
Expert @novinosrin:
1. Just a little bit more. Scanning a numeric array is about 5 times faster than reading a SAS data set with 1 numeric variable sequentially. Hence, when a BY group is first read in from disc and then post-processed using an array, the latter doesn't quite add what would amount to the second read.
2. If you mean sequential read (i.e. the enumeration operation), then yes, the temp arrays are quicker by quite a bit. Judge for yourself:
data _null_ ;
dcl hash h() ;
h.definekey ("k") ;
h.definedone () ;
dcl hiter hi ("h") ;
do k = 1 to 1e6 ;
h.add() ;
end ;
t = time() ;
do r = 1 to 100 ;
do while (hi.next() = 0) ;
end ;
end ;
hash_time = time() - t ;
array a [1000000] _temporary_ ;
t = time() ;
do r = 1 to 100 ;
do _n_ = 1 to dim (a) ;
k = a[_n_] ;
end ;
end ;
arry_time = time() - t ;
put hash_time=z6.3 / arry_time=z6.3 ;
run ;
And the SAS log (which never lies) reports:
hash_time=15.105
arry_time=01.024
Remember, though, that the hash object's claim to fame is not sequential performance. It's quite obvious that traversing an AVL tree is a far more laborious task than scanning through an array (which requires nothing more than adding a fixed number to the address of the first array item). Though it's nice and convenient that SAS has included the enumeration operation with the hash object, its main strength lies is super-quick random access to data in memory by an arbitrary (i.e. simple or mixed type composite) key. Besides, the functionality of the hash object is way wider that that of an array, and matching even a fraction of it using arrays requires custom coding of the sort most SAS users are neither capable of nor should be asked to do - which is why the hash object was created in the first place.
On the other hand, for this particular type of task, a key-indexed temp array offers the advantage of algorithmic simplicity besides the speed. Note that by key-indexing the month numbers relative to a fixed point in the past (which never has to be changed, so using it makes no assumptions about the data), you automatically get all the month numbers in between as null-filled slots, and outputting them is no more complex than merely scanning through the array. If you stored the dates in a hash table instead, you'd still have to execute some kind of in-between logic. Of course, you could alternatively pre-fill the table with all months from min to max, then update it with the months actually available in the BY group and iterated through the table to get essentially what the array does. Surely, it has the advantage of not having to use the "big enough" type logic, but the latter is a small price to pay for a much simpler approach, especially since using 99999 in this case (heck, you could use 999999 if it should strike your fancy) is absolutely bulletproof.
3. Methinks I've already addressed that.
Kind regards
Paul D.
p.s. Thanks for the plug.
Guru, Classic and diligence at best. Thank you * 1e6 or never can thank enough for your time. While I have thoroughly comprehended the explanation, all I can possibly say is your detailed explanation is "sublime and elegant". I mean it. Simply Priceless!
Have a good night and take care!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.