BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jaiganesh
Obsidian | Level 7

Hello,

 

Can anybody help me to create the code which make the cumulative sum for the variable Marks used in below data set with LAG Keyword/Function.

 

 

Input:

data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;

 

 
output  Should be 
 
id names marks New_marks

1    jai        10            10
2    sai       20            30
3    ram     30            60
4   shyam 40            100
5   hari      50            150

 

Please provide the output only with using LAG function not Looking with  Retain Keyword.

 

 

Regards,

Jaiganesh

1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @jaiganesh 

 

The Lag function returns the value from a previous input record. Lag1 (or just Lag) returns the value from the record preceding the current record, Lag2 from the one before that and so on. There is no limit on the number of Lags other than the available memory, so if you have 1000 records, you can use Lag1 - Lag999.

 

But you cannot use Lag on variables created in the data step, so you cannot accumulate values using Lag. Here is an example using Lag on your data to give the wanted result, so you can see what happens:

 

 

data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;

data want; set kk; 
	lag1marks = lag1(marks);
	lag2marks = lag2(marks);
	lag3marks = lag3(marks);
	lag4marks = lag4(marks);
	New_marks = sum(marks,lag1marks,lag2marks,lag3marks,lag4marks);
run;

lag.gif

 

 

 

 

 

 

 

 

 

This works with your 5 observations, because the code uses Lag1 - Lag4, but it cannot be made dynamic, at least not without a lot of macro coding. If there were more observations, it would give a rolling sum over the last 5 observarions, which might be useful in some cases, but to use Lag to make a total summation over a column in a data set is - politely expressed - not advisable. 

 

 

View solution in original post

15 REPLIES 15
andreas_lds
Jade | Level 19
And why don't you want to use retain? Have you read the examples in the online docs? This is such an easy task, it should be solvable with some research.
jaiganesh
Obsidian | Level 7
I've been using Retain Keyword, one of my colleague has raise the concern of LAG function functionality and it's application in programming, Let me know if you need any other information.
ChrisNZ
Tourmaline | Level 20

The function LAG is no appropriate here.

Ask your colleague to enlighten us and show us how it's done. 

 

Note this was already asked by someone else and on the same day (and with equivalent replies).

https://communities.sas.com/t5/New-SAS-User/How-to-calculate-cumulative-sum-using-Lag-function/td-p/...

Colleague or classmate?

 

 

Reeza
Super User

If you show an attempt, happy to help you work through it. 

 

The documentation has some examples that are pretty helpful if you work through them first:

 

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=lefunctionsref&docsetTarg...

 


@jaiganesh wrote:

Hello,

 

Can anybody help me to create the code which make the cumulative sum for the variable Marks used in below data set with LAG Keyword/Function.

 

 

Input:

data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;

 

 
output  Should be 
 
id names marks New_marks

1    jai        10            10
2    sai       20            30
3    ram     30            60
4   shyam 40            100
5   hari      50            150

 

Please provide the output only with using LAG function not Looking with  Retain Keyword.

 

 

Regards,

Jaiganesh


 

jaiganesh
Obsidian | Level 7
Mentioned Link does not help me much with respect to certain with LAG Function cumulative addition.

Can you provide anything else
Kurt_Bremser
Super User

@jaiganesh wrote:
Mentioned Link does not help me much with respect to certain with LAG Function cumulative addition.

Can you provide anything else

Because the lag() function is NOT the right tool for your task, period.

PaigeMiller
Diamond | Level 26

@jaiganesh wrote:
Mentioned Link does not help me much with respect to certain with LAG Function cumulative addition.

Can you provide anything else

Many people are saying that LAG is not the right tool for this purpose. What more do you need? Furthermore, there is no documentation that we can point to that says function X should not be used to perform Task A. People don't write documentation listing all the things that a specific function can't do.

--
Paige Miller
FreelanceReinh
Jade | Level 19

Hi @jaiganesh,

 

I fully agree with all the others that the LAG function is not ideal for this task (to say the least). But you could use this (inefficient) code in order to apply it:

data _null_;
set kk end=last;
if _n_=1 then call execute('data want; set kk;');
call execute('new_marks=sum(marks,lag(new_marks));');
if last then call execute('run;');
run;

Edit: Or this:

data want;
set kk;
do _n_=0, 1;
  if _n_ then output;
  new_marks=sum(marks,lag(new_marks));
end;
run;

 

Kurt_Bremser
Super User

 

I'm really puzzled.

What is the functional difference between

data want;
set kk;
do _n_=0, 1;
  if _n_ then output;
  new_marks=sum(marks,lag(new_marks));
end;
run;

and

data want2;
set kk;
new_marks=sum(marks,lag(new_marks));
output;
new_marks=sum(marks,lag(new_marks));
run;

?

(the first works, the second not, although the summarization and output statements execute in the same order)

FreelanceReinh
Jade | Level 19

The crucial point is that each occurrence of the LAG function in the DATA step corresponds to a separate queue.

 

So, in the first code one queue operates twice: The idea is that the queue should always output the previous cumulative sum (to be added to the current value of MARKS in order to obtain the desired value of NEW_MARKS). This happens in the first iteration of the DO loop. The second iteration of the loop is required because the calculated value of NEW_MARKS must be fed into the queue, so that it will be output in the next call of the LAG function (in the next iteration of the DATA step). This cannot be accomplished during the calculation because obviously the value just being calculated is not available yet and hence a missing value is fed into the queue (since NEW_MARKS is not RETAINed). This missing value is then returned in the second iteration of the DO loop, leading (temporarily) to a "wrong" value of NEW_MARKS, namely NEW_MARKS=MARKS, but this doesn't matter because it happens after the OUTPUT statement.

 

The second code creates two separate queues. The first one does always the same in each iteration of the DATA step:

  1. It outputs a missing value (in the first iteration of the DATA step this is the usual behavior in the first call of any LAG function, in subsequent iterations it's because of item 2 below).
  2. It stores a missing value because the unretained variable NEW_MARKS has been initialized to missing and not yet assigned a value at the time when the corresponding LAG function is called.

Hence, the value of NEW_MARKS which is output is always SUM(MARKS, .) = MARKS.

 

The second queue stores the value of NEW_MARKS created in the previous assignment statement (i.e. [with the data in dataset kk] 10, 20, 30, ... -- the values of MARKS, see above) and outputs the previous one, thus assigning temporary values 10, 30, 50, 70, ... to NEW_MARKS, but these aren't output to dataset WANT2 anyway, nor are they retained. So, the second assignment statement is useless.

ErikLund_Jensen
Rhodochrosite | Level 12

@FreelanceReinh 

That was really interesting, thank you so much! - I am sorry to have mislead @jaiganesh, but I never realized that lag is not coupled to the input buffer, but works on any assignment, so a construction like this works:

 

data _null_;
do a = 1 to 5;
b = lag(a);
c = lag(b);
d = lag(c);
e = lag(d);
put a= b= c= d= e=;
end;
run; a=1 b=. c=. d=. e=. a=2 b=1 c=. d=. e=. a=3 b=2 c=1 d=. e=. a=4 b=3 c=2 d=1 e=. a=5 b=4 c=3 d=2 e=1
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @jaiganesh 

 

The Lag function returns the value from a previous input record. Lag1 (or just Lag) returns the value from the record preceding the current record, Lag2 from the one before that and so on. There is no limit on the number of Lags other than the available memory, so if you have 1000 records, you can use Lag1 - Lag999.

 

But you cannot use Lag on variables created in the data step, so you cannot accumulate values using Lag. Here is an example using Lag on your data to give the wanted result, so you can see what happens:

 

 

data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;

data want; set kk; 
	lag1marks = lag1(marks);
	lag2marks = lag2(marks);
	lag3marks = lag3(marks);
	lag4marks = lag4(marks);
	New_marks = sum(marks,lag1marks,lag2marks,lag3marks,lag4marks);
run;

lag.gif

 

 

 

 

 

 

 

 

 

This works with your 5 observations, because the code uses Lag1 - Lag4, but it cannot be made dynamic, at least not without a lot of macro coding. If there were more observations, it would give a rolling sum over the last 5 observarions, which might be useful in some cases, but to use Lag to make a total summation over a column in a data set is - politely expressed - not advisable. 

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 4747 views
  • 7 likes
  • 8 in conversation