Hi:
Just for comparison purposes, the DATA step program that would create a cumulative total, by DEVELOPMENT_PK would look like this (assumes your dataset is sorted or ordered by DEVELOPMENT_PK):
[pre]
data BilledUnitsCuSum ;
set BilledUnitsFixed ;
by development_pk;
retain CuSum;
if first.development_pk then CuSum= 0;
CuSum + total;
run;
proc print data=BilledUnitsCuSum;
run;
[/pre]
The program breaks down like this:
DATA statement: creates new dataset
SET statement: identifies the input dataset
BY statement: turns on by group processing using the variable listed
RETAIN statement: explicitly lists a variable whose value should be retained or "remembered" across each iteration of the data step
IF statement: tests whether an observation contains the first development_pk for a BY group and if true, resets the CuSum variable value to 0
SUM statement: accumulates CuSum by adding the value of TOTAL to it. (Note that the keyword "SUM" does not appear on this statement. This form of an assignment statement is known as a SUM statement in the documentation. Do not confuse this statement with the SUM function.)
RUN statement: ends the program by providing a step boundary
After this program runs, every observation in the dataset would now have a new variable called CuSum, which would be the cumulative total amount for that development_pk only.
One advantage of this approach is that it does not require a self-join because the RETAIN statement retains the value of the CuSum variable across iterations of the DATA step program. This means that until CuSum is reset, the value of TOTAL will keep getting added to CuSum. One feature of BY group processing is that, inside the DATA step, you can use FIRST.byvar and LAST.byvar automatic variables to test whether the input buffer holds the first or last observation in a BY group.
The program is explicitly setting CuSum to 0 at the first observation for every DEVELOPMENT_PK, because the IF statement is testing for the occurence of FIRST.DEVELOOPMENT_PK = 1. The automatic variable FIRST.DEVELOPMENT_PK will be equal to 1 at the first observation and equal to 0 on the other observations for the by group. The shorthand or Boolean version of the IF statement:
if first.development_pk then....
is the same as coding
if first.development_pk = 1 then ....
If you don't have many observations, then either method would probably be OK. But, if you have a LOT of observations, then you might want to benchmark for performance. Since a join is not involved you might find the DATA step performs better for larger data sets.
cynthia