Solved: Cumulative Sum without sorting

kashun · Posted 10-20-2020 12:46 PM

I am trying to find a way to create cumulative sum without changing positions of observations.

Have.

Observation	Name	Amount
1	John	10
2	Mark	20
3	Mark	10
4	John	40
5	John	30
6	Mark	20
7	John	10

Want

Observation	Name	Amount	Cummulative Sum
1	John	10	10
2	Mark	20	20
3	Mark	10	30
4	John	40	50
5	John	30	80
6	Mark	20	50
7	John	10	90

I tried by Name notsorted and did not work since the notsorted option group name into 5 groups.

Will be very grateful if someone can assist me on how to go about this or refer me to a link that can help.

Thanks.

FreelanceReinh · Posted 10-20-2020 01:11 PM

Hi @kashun,

Here's another hash object solution:

data want;
if _n_=1 then do;
  dcl hash h(suminc:'amount');
  h.definekey('name');
  h.definedone();
end;
set have;
h.ref();
h.sum(sum:Cumulative_Sum);
run;

View solution in original post

PaigeMiller · Posted 10-20-2020 12:53 PM

Sort by name, compute cumulative sums, un-sort back to the original order.

--
Paige Miller

novinosrin · Posted 10-20-2020 01:01 PM

Hi @kashun FWIW-



data have;
input Observation	Name $	Amount;
cards;
1	John	10
2	Mark	20
3	Mark	10
4	John	40
5	John	30
6	Mark	20
7	John	10
;

data want;
 set have;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("name") ;
   h.definedata ("Cummulative_Sum") ;
   h.definedone () ;
 end;
 if h.find() ne 0 then Cummulative_Sum=amount;
 else Cummulative_Sum=sum(Cummulative_Sum,amount);
 h.replace();
run;

Reeza · Posted 10-20-2020 01:03 PM

How big will that name list get? There are ways but they're all more work than just sorting/unsorting.

kashun · Posted 10-20-2020 01:45 PM

I am working with about 12+ GB of data with different names. Distinct count of names is more than 100 million.

FreelanceReinh · Posted 10-20-2020 01:11 PM

Hi @kashun,

Here's another hash object solution:

data want;
if _n_=1 then do;
  dcl hash h(suminc:'amount');
  h.definekey('name');
  h.definedone();
end;
set have;
h.ref();
h.sum(sum:Cumulative_Sum);
run;

novinosrin · Posted 10-20-2020 01:19 PM

Sir @FreelanceReinh That's what separated a genius from mere mortal. Brilliant thinking!!! Kudos!

FreelanceReinh · Posted 10-20-2020 01:34 PM

@novinosrin: Thanks. 🙂 I just thought this might be one of the "very limited and special cases" (The book, p. 219) where the SUM method proves useful.

kashun · Posted 10-20-2020 01:56 PM

@ FreelanceReinhard. This is awesome. I might be wrong but I know hash objects stores a temporary data into memory. Could there be another approach without using hash objects?

FreelanceReinh · Posted 10-20-2020 04:10 PM

@kashun wrote:

@ FreelanceReinhard. This is awesome. I might be wrong but I know hash objects stores a temporary data into memory. Could there be another approach without using hash objects?

It's true that the hash object would occupy a considerable amount of memory. Another common approach is to create an index on dataset HAVE, but the computation of Cumulative_Sum BY Name (using the index) would result in a sorted dataset.

I would try to implement the hash object approach, first on a smaller (but not too small) subset of HAVE. Then you could estimate the amount of memory needed for the full dataset. Depending on the length of variable Name it might be possible to reduce the memory footprint. (Note that in my code Name is also used as a data item, but this could be changed.) If it still exceeds the available RAM, maybe there's a possibility to split dataset HAVE or to take advantage of known characteristics of the dataset structure. For example, if a name is known to occur only up to a certain observation, its hash entry could be removed once that observation is reached.

CurtisMackWSIPP · Posted 10-20-2020 02:07 PM

This would do it, but I suspect your case is more complicated some how.

data want;
  set have;
  retain Cummulative_Sum;
  label Cummulative_Sum = "Cummulative Sum";
  Cummulative_Sum = sum(Cummulative_Sum,Amount);
run;

kashun · Posted 10-20-2020 02:24 PM

Yes. It looks like this is not taking into account name

RichardDeVen · Posted 10-20-2020 07:07 PM

You can use several step operations to compute the cumulative sum and restore the original order

Example:

data have_v / view=have_v;
  set have;
  rownum = _n_;
run;

proc sort data=have_v out=have_ord;
  by name rownum;
run;

data want_v / view=want_v;
  set have_ord;
  by name;
  if first.name then cusum = 0;
  cusum + amount;
run;

proc sort data=want_v out=want(drop=rownum);
  by rownum;  
run;

proc sql; 
  drop table have, have_ord;
  drop view have_v, want_v;

Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Re: Cumulative Sum without sorting

Classroom Training Available!