I am trying to find a way to create cumulative sum without changing positions of observations.
Have.
Observation | Name | Amount |
1 | John | 10 |
2 | Mark | 20 |
3 | Mark | 10 |
4 | John | 40 |
5 | John | 30 |
6 | Mark | 20 |
7 | John | 10 |
Want
Observation | Name | Amount | Cummulative Sum |
1 | John | 10 | 10 |
2 | Mark | 20 | 20 |
3 | Mark | 10 | 30 |
4 | John | 40 | 50 |
5 | John | 30 | 80 |
6 | Mark | 20 | 50 |
7 | John | 10 | 90 |
I tried by Name notsorted and did not work since the notsorted option group name into 5 groups.
Will be very grateful if someone can assist me on how to go about this or refer me to a link that can help.
Thanks.
Hi @kashun,
Here's another hash object solution:
data want;
if _n_=1 then do;
dcl hash h(suminc:'amount');
h.definekey('name');
h.definedone();
end;
set have;
h.ref();
h.sum(sum:Cumulative_Sum);
run;
Sort by name, compute cumulative sums, un-sort back to the original order.
Hi @kashun FWIW-
data have;
input Observation Name $ Amount;
cards;
1 John 10
2 Mark 20
3 Mark 10
4 John 40
5 John 30
6 Mark 20
7 John 10
;
data want;
set have;
if _n_=1 then do;
dcl hash H () ;
h.definekey ("name") ;
h.definedata ("Cummulative_Sum") ;
h.definedone () ;
end;
if h.find() ne 0 then Cummulative_Sum=amount;
else Cummulative_Sum=sum(Cummulative_Sum,amount);
h.replace();
run;
I am working with about 12+ GB of data with different names. Distinct count of names is more than 100 million.
Hi @kashun,
Here's another hash object solution:
data want;
if _n_=1 then do;
dcl hash h(suminc:'amount');
h.definekey('name');
h.definedone();
end;
set have;
h.ref();
h.sum(sum:Cumulative_Sum);
run;
Sir @FreelanceReinh That's what separated a genius from mere mortal. Brilliant thinking!!! Kudos!
@novinosrin: Thanks. 🙂 I just thought this might be one of the "very limited and special cases" (The book, p. 219) where the SUM method proves useful.
@ FreelanceReinhard. This is awesome. I might be wrong but I know hash objects stores a temporary data into memory. Could there be another approach without using hash objects?
@kashun wrote:
@ FreelanceReinhard. This is awesome. I might be wrong but I know hash objects stores a temporary data into memory. Could there be another approach without using hash objects?
It's true that the hash object would occupy a considerable amount of memory. Another common approach is to create an index on dataset HAVE, but the computation of Cumulative_Sum BY Name (using the index) would result in a sorted dataset.
I would try to implement the hash object approach, first on a smaller (but not too small) subset of HAVE. Then you could estimate the amount of memory needed for the full dataset. Depending on the length of variable Name it might be possible to reduce the memory footprint. (Note that in my code Name is also used as a data item, but this could be changed.) If it still exceeds the available RAM, maybe there's a possibility to split dataset HAVE or to take advantage of known characteristics of the dataset structure. For example, if a name is known to occur only up to a certain observation, its hash entry could be removed once that observation is reached.
This would do it, but I suspect your case is more complicated some how.
data want;
set have;
retain Cummulative_Sum;
label Cummulative_Sum = "Cummulative Sum";
Cummulative_Sum = sum(Cummulative_Sum,Amount);
run;
Yes. It looks like this is not taking into account name
You can use several step operations to compute the cumulative sum and restore the original order
Example:
data have_v / view=have_v; set have; rownum = _n_; run; proc sort data=have_v out=have_ord; by name rownum; run; data want_v / view=want_v; set have_ord; by name; if first.name then cusum = 0; cusum + amount; run; proc sort data=want_v out=want(drop=rownum); by rownum; run;
proc sql;
drop table have, have_ord;
drop view have_v, want_v;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.