Solved: Re: How to calulate a cumulative sum for each observation by id

ZSB · Posted 01-13-2021 05:26 AM

Hi,

My problem is about calculating the cumulative sum for each id and for each date taking into account a sliding period of 15 previous days. If the cumulative sum exceeds 10k, the variable top is incremented.

The treatment is done for June only.

Here is an example of the desired result :

id	app_date	price	cum	top
1	29-juin-20	4000	4 000	.
1	13-juin-20	5000	45 000	1
1	13-juin-20	6000	40 000	2
1	11-juin-20	7000	34 000	3
1	10-juin-20	8000	27 000	4
1	01-juin-20	9000	19 000	5
1	30-mai-20	10000	10 000	.

proc sort data = tab out= tab1;
by id descending app_date;

run;

data tab2;

set tab1;

%let annee=2020;

%let month=06;

by id;

retain last_date date_last_d CUM;

if first.id then do;

last_date =app_date;

date_last_dem = app_date;

CUM=0;

end;

if month(date_last_d) =&month. then do ;

diff= date_last_d -app_date;

CUM= price+ CUM;

end;

if diff>15 then do;

diff = .;

CUM =.;

last_date =app_date;

date_last_d = app_date;

end;

if last.id and CUM>10000 then top= top+1 ;

output;

last_date=app_date;

format last_date DDMMYY10.;

format date_last_d DDMMYY10.;

format CUM 14.2;

run;

I can do it for the first iteration but I cannot do it for all the lines.

Thank you,

PeterClemmensen · Posted 01-13-2021 10:30 AM

I interpret your correction like this: For ID 101666, you want all prices to be involved in the calculation of cum in the first iteration. However, for the second iteration, the first value should not be included, even though the date range is the same as the one above. Correct?

In that case, it is a small correction. Let me know if it works for you.

data have;
input Cnt Price ID App_date :ddmmyy10.;
format App_date ddmmyy10.;
datalines;
1  2265 534    30/05/2020
2  2330 4594   27/06/2020
3  1360 723    14/05/2020
4  1393 723    14/05/2020
5  2400 101666 12/06/2020
6  2411 101666 12/06/2020
7  2400 101666 11/06/2020
8  2400 101666 11/06/2020
9  2527 101666 10/06/2020
10 2536 101666 10/06/2020
11 2458 101666 04/06/2020
12 2758 1088   30/05/2020
13 4412 1056   13/06/2020
14 1870 1255   30/06/2020
15 4198 1255   14/05/2020
;

data want(drop = c k p dt);

   dcl hash h(ordered : "Y");
   h.definekey("c");
   h.definedata("c", "p", "dt");
   h.definedone();
   dcl hiter i("h");

   do c = 1 by 1 until (last.ID);
      set have(rename=(App_Date=dt Price=p));
      by ID notsorted;
      h.add();
   end;

   do k = 1 by 1 until (last.ID);
      set have;
      by ID notsorted;
      cum = 0;
      do while (i.next() = 0);
         if App_Date - 15 <= dt <= App_Date & k <= c then cum + p;
      end;
      if cum > 10000 then top + 1;
      else top = .;
      output;
   end;

   h.clear();
run;

Result:

Cnt  ID      Price  App_date    cum    top 
1    534     2265   30/05/2020  2265   . 
2    4594    2330   27/06/2020  2330   . 
3    723     1360   14/05/2020  2753   . 
4    723     1393   14/05/2020  1393   . 
5    101666  2400   12/06/2020  17132  1 
6    101666  2411   12/06/2020  14732  2 
7    101666  2400   11/06/2020  12321  3 
8    101666  2400   11/06/2020  9921   . 
9    101666  2527   10/06/2020  7521   . 
10   101666  2536   10/06/2020  4994   . 
11   101666  2458   04/06/2020  2458   . 
12   1088    2758   30/05/2020  2758   . 
13   1056    4412   13/06/2020  4412   . 
14   1255    1870   30/06/2020  1870   . 
15   1255    4198   14/05/2020  4198   .

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 01-13-2021 05:32 AM

Why does cum start at 4000 but jump to 45000?

The DATA to DATA Step Macro
Blog: SASnrd

andreas_lds · Posted 01-13-2021 06:34 AM

Thanks for sharing code and the expected output, but to fully understand the problem, i need to see the dataset tab1 (usable form preferred).

ZSB · Posted 01-13-2021 07:18 AM

There is a sample of data tab1 :

Cnt	Price	ID	App_date
1	2265	534	30/05/2020
2	2330	4594	27/06/2020
3	1360	723	14/05/2020
4	1393	723	14/05/2020
5	2400	101666	12/06/2020
6	2411	101666	12/06/2020
7	2400	101666	11/06/2020
8	2400	101666	11/06/2020
9	2527	101666	10/06/2020
10	2536	101666	10/06/2020
11	2458	101666	04/06/2020
12	2758	1088	30/05/2020
13	4412	1056	13/06/2020
14	1870	1255	30/06/2020
15	4198	1255	14/05/2020

For example, for the id 101666 the top should be equal to 3.

Iteration 1 : from cnt5 to cnt11--> CUM>10k and top=1

Iteration 2 : from cnt6 to cnt11--> CUM>10k and top=2

Iteration 3 : from cnt7 to cnt11--> CUM>10k and top=3

Note that for each iteration, i calculate the CUM over a period of 15 days.

Thanks,

PeterClemmensen · Posted 01-13-2021 08:01 AM

How about

data have;
input Cnt Price ID App_date :ddmmyy10.;
format App_date ddmmyy10.;
datalines;
1  2265 534    30/05/2020
2  2330 4594   27/06/2020
3  1360 723    14/05/2020
4  1393 723    14/05/2020
5  2400 101666 12/06/2020
6  2411 101666 12/06/2020
7  2400 101666 11/06/2020
8  2400 101666 11/06/2020
9  2527 101666 10/06/2020
10 2536 101666 10/06/2020
11 2458 101666 04/06/2020
12 2758 1088   30/05/2020
13 4412 1056   13/06/2020
14 1870 1255   30/06/2020
15 4198 1255   14/05/2020
;

data want(drop = c p dt);

   dcl hash h(ordered : "Y");
   h.definekey("c");
   h.definedata("p", "dt");
   h.definedone();
   dcl hiter i("h");

   do c = 1 by 1 until (last.ID);
      set have(rename=(App_Date=dt Price=p));
      by ID notsorted;
      h.add();
   end;

   do until (last.ID);
      set have;
      by ID notsorted;
      cum = 0;
      do while (i.next() = 0);
         if App_Date - 15 <= dt <= App_Date then cum + p;
      end;
      if cum > 10000 then top + 1;
      else top = .;
      output;
   end;

   h.clear();
run;

Result:

Cnt  ID      Price  App_date    cum    top 
1    534     2265   30/05/2020  2265   . 
2    4594    2330   27/06/2020  2330   . 
3    723     1360   14/05/2020  2753   . 
4    723     1393   14/05/2020  2753   . 
5    101666  2400   12/06/2020  17132  1 
6    101666  2411   12/06/2020  17132  2 
7    101666  2400   11/06/2020  12321  3 
8    101666  2400   11/06/2020  12321  4 
9    101666  2527   10/06/2020  7521   . 
10   101666  2536   10/06/2020  7521   . 
11   101666  2458   04/06/2020  2458   . 
12   1088    2758   30/05/2020  2758   . 
13   1056    4412   13/06/2020  4412   . 
14   1255    1870   30/06/2020  1870   . 
15   1255    4198   14/05/2020  4198   .

The DATA to DATA Step Macro
Blog: SASnrd

ZSB · Posted 01-13-2021 10:10 AM

thank you very much for your answer.

Indeed, in terms of treatment, that's exactly what it is expected.

But, for the id 101666, top must be stopped at 3.

The first iteration, CUM = 2400+2411+2400+2400+2527+2536+2458 = 17131>10000 --> top=1

The second iteration, CUM = 2411+2400+2400+2527+2536+2458 = 14732 >10000 --> top=2

The third iteration, CUM = 2400+2400+2527+2536+2458 = 12321>10000 --> top=3

The fourth one, CUM =+2400+2527+2536+2458 = 12321<10000

PeterClemmensen · Posted 01-13-2021 10:30 AM

I interpret your correction like this: For ID 101666, you want all prices to be involved in the calculation of cum in the first iteration. However, for the second iteration, the first value should not be included, even though the date range is the same as the one above. Correct?

In that case, it is a small correction. Let me know if it works for you.

data have;
input Cnt Price ID App_date :ddmmyy10.;
format App_date ddmmyy10.;
datalines;
1  2265 534    30/05/2020
2  2330 4594   27/06/2020
3  1360 723    14/05/2020
4  1393 723    14/05/2020
5  2400 101666 12/06/2020
6  2411 101666 12/06/2020
7  2400 101666 11/06/2020
8  2400 101666 11/06/2020
9  2527 101666 10/06/2020
10 2536 101666 10/06/2020
11 2458 101666 04/06/2020
12 2758 1088   30/05/2020
13 4412 1056   13/06/2020
14 1870 1255   30/06/2020
15 4198 1255   14/05/2020
;

data want(drop = c k p dt);

   dcl hash h(ordered : "Y");
   h.definekey("c");
   h.definedata("c", "p", "dt");
   h.definedone();
   dcl hiter i("h");

   do c = 1 by 1 until (last.ID);
      set have(rename=(App_Date=dt Price=p));
      by ID notsorted;
      h.add();
   end;

   do k = 1 by 1 until (last.ID);
      set have;
      by ID notsorted;
      cum = 0;
      do while (i.next() = 0);
         if App_Date - 15 <= dt <= App_Date & k <= c then cum + p;
      end;
      if cum > 10000 then top + 1;
      else top = .;
      output;
   end;

   h.clear();
run;

Result:

Cnt  ID      Price  App_date    cum    top 
1    534     2265   30/05/2020  2265   . 
2    4594    2330   27/06/2020  2330   . 
3    723     1360   14/05/2020  2753   . 
4    723     1393   14/05/2020  1393   . 
5    101666  2400   12/06/2020  17132  1 
6    101666  2411   12/06/2020  14732  2 
7    101666  2400   11/06/2020  12321  3 
8    101666  2400   11/06/2020  9921   . 
9    101666  2527   10/06/2020  7521   . 
10   101666  2536   10/06/2020  4994   . 
11   101666  2458   04/06/2020  2458   . 
12   1088    2758   30/05/2020  2758   . 
13   1056    4412   13/06/2020  4412   . 
14   1255    1870   30/06/2020  1870   . 
15   1255    4198   14/05/2020  4198   .

The DATA to DATA Step Macro
Blog: SASnrd

ZSB · Posted 01-13-2021 11:34 AM

Exactly!

Yes it works, thank you!

Have a good day!

ballardw · Posted 01-13-2021 11:05 AM

@ZSB wrote:

thank you very much for your answer.

Indeed, in terms of treatment, that's exactly what it is expected.

But, for the id 101666, top must be stopped at 3.

The first iteration, CUM = 2400+2411+2400+2400+2527+2536+2458 = 17131>10000 --> top=1

The second iteration, CUM = 2411+2400+2400+2527+2536+2458 = 14732 >10000 --> top=2

The third iteration, CUM = 2400+2400+2527+2536+2458 = 12321>10000 --> top=3

The fourth one, CUM =+2400+2527+2536+2458 = 12321<10000

I don't see anything in your original post about "top must be stopped at 3".

My problem is about calculating the cumulative sum for each id and for each date taking into account a sliding period of 15 previous days. If the cumulative sum exceeds 10k, the variable top is incremented.

The treatment is done for June only.

Do you have a bunch of arbitrary rules like that not shared yet? Such as possibly the "top" gets reset for each calendar month (or other period)? Not providing all the requirements up front complicates things because it may be the additional information requires a different approach.

Just for an example of something else not explicitly covered, you say " exceeds 10k," What would you have the result be if a single value added resulted in a sum of much more than 10K? Such as 100 + 100000? That crosses multiple 10K boundaries.

And do you have any negative values? If you do, would they be subtracted from the cumulative total or not?

ZSB · Posted 01-13-2021 11:55 AM

Thank you for answer!

I have already the solution for my problem.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away