Re: Cumulative Sum Across Years and Months

StephenOverton · Posted 07-29-2009 10:04 AM

I'm trying to find the cumulative sum by development from month to month in this dataset:

development_pk year month total
272 2005 10 14
272 2005 11 18
272 2005 12 11
272 2006 1 8
272 2006 2 3
272 2006 3 22

StephenOverton · Posted 07-29-2009 10:21 AM

Forum kept truncating my post due to a less than sign, the 'le' in bold below is actually a less than or equal.

This is my proc sql so far:

create table BilledUnitsCuSum as
select t1.*, sum(t2.total) as CuSum
from BilledUnitsFixed as t1
join BilledUnitsFixed as t2 on t2.development_pk = t1.development_pk and t2.year = t1.year and t2.month le t1.month
group by t1.development_pk, t1.year, t1.month, t1.total
order by development_pk, year, month

But as you can see from this output, the cumulative sum resets after each year because the month resets back to 1:

development_pk year month total
272 2005 10 14 59
272 2005 11 18 77
272 2005 12 11 88
272 2006 1 8 8
272 2006 2 3 11
272 2006 3 22 33

How can I take advantage of using the _N_ variable (I think this is the observation/row number) to use in my join clause instead of the month??

Thanks!

StephenOverton · Posted 07-29-2009 11:17 AM

Disregard, I figured it out.

I created a new column in the previous data step called ID:

ID = _n_;

Then here is the updated SQL (changing the 'le' to an actual sign):

proc sql;
create table BilledUnitsCuSum as
select t1.development_pk, t1.year, t1.month, t1.total, sum(t2.total) as CuSum
from BilledUnitsFixed as t1
join BilledUnitsFixed as t2 on t2.development_pk = t1.development_pk and t2.ID le t1.ID
group by t1.development_pk, t1.year, t1.month, t1.total
order by development_pk, year, month
;
quit; Stupid less than sign messing up....

Message was edited by: SteveONCSU

Cynthia_sas · Posted 07-29-2009 12:00 PM

Hi:
Just for comparison purposes, the DATA step program that would create a cumulative total, by DEVELOPMENT_PK would look like this (assumes your dataset is sorted or ordered by DEVELOPMENT_PK):
[pre]
data BilledUnitsCuSum ;
set BilledUnitsFixed ;
by development_pk;
retain CuSum;
if first.development_pk then CuSum= 0;
CuSum + total;
run;

proc print data=BilledUnitsCuSum;
run;
[/pre]

The program breaks down like this:
DATA statement: creates new dataset
SET statement: identifies the input dataset
BY statement: turns on by group processing using the variable listed
RETAIN statement: explicitly lists a variable whose value should be retained or "remembered" across each iteration of the data step
IF statement: tests whether an observation contains the first development_pk for a BY group and if true, resets the CuSum variable value to 0
SUM statement: accumulates CuSum by adding the value of TOTAL to it. (Note that the keyword "SUM" does not appear on this statement. This form of an assignment statement is known as a SUM statement in the documentation. Do not confuse this statement with the SUM function.)
RUN statement: ends the program by providing a step boundary

After this program runs, every observation in the dataset would now have a new variable called CuSum, which would be the cumulative total amount for that development_pk only.

One advantage of this approach is that it does not require a self-join because the RETAIN statement retains the value of the CuSum variable across iterations of the DATA step program. This means that until CuSum is reset, the value of TOTAL will keep getting added to CuSum. One feature of BY group processing is that, inside the DATA step, you can use FIRST.byvar and LAST.byvar automatic variables to test whether the input buffer holds the first or last observation in a BY group.

The program is explicitly setting CuSum to 0 at the first observation for every DEVELOPMENT_PK, because the IF statement is testing for the occurence of FIRST.DEVELOOPMENT_PK = 1. The automatic variable FIRST.DEVELOPMENT_PK will be equal to 1 at the first observation and equal to 0 on the other observations for the by group. The shorthand or Boolean version of the IF statement:
if first.development_pk then....
is the same as coding
if first.development_pk = 1 then ....

If you don't have many observations, then either method would probably be OK. But, if you have a LOT of observations, then you might want to benchmark for performance. Since a join is not involved you might find the DATA step performs better for larger data sets.

cynthia

StephenOverton · Posted 07-29-2009 12:02 PM

Thanks for the improvement, I think this will be much more scalable in the future.

Cumulative Sum Across Years and Months