Hello,
Can anybody help me to create the code which make the cumulative sum for the variable Marks used in below data set with LAG Keyword/Function.
Input:
data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;
1 jai 10 10
2 sai 20 30
3 ram 30 60
4 shyam 40 100
5 hari 50 150
Please provide the output only with using LAG function not Looking with Retain Keyword.
Regards,
Jaiganesh
Hi @jaiganesh
The Lag function returns the value from a previous input record. Lag1 (or just Lag) returns the value from the record preceding the current record, Lag2 from the one before that and so on. There is no limit on the number of Lags other than the available memory, so if you have 1000 records, you can use Lag1 - Lag999.
But you cannot use Lag on variables created in the data step, so you cannot accumulate values using Lag. Here is an example using Lag on your data to give the wanted result, so you can see what happens:
data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;
data want; set kk;
lag1marks = lag1(marks);
lag2marks = lag2(marks);
lag3marks = lag3(marks);
lag4marks = lag4(marks);
New_marks = sum(marks,lag1marks,lag2marks,lag3marks,lag4marks);
run;
This works with your 5 observations, because the code uses Lag1 - Lag4, but it cannot be made dynamic, at least not without a lot of macro coding. If there were more observations, it would give a rolling sum over the last 5 observarions, which might be useful in some cases, but to use Lag to make a total summation over a column in a data set is - politely expressed - not advisable.
Maxim 14: Use the Right Tool, in this case retain.
The function LAG is no appropriate here.
Ask your colleague to enlighten us and show us how it's done.
Note this was already asked by someone else and on the same day (and with equivalent replies).
Colleague or classmate?
If you show an attempt, happy to help you work through it.
The documentation has some examples that are pretty helpful if you work through them first:
@jaiganesh wrote:
Hello,
Can anybody help me to create the code which make the cumulative sum for the variable Marks used in below data set with LAG Keyword/Function.
Input:
data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;
output Should beid names marks New_marks1 jai 10 10
2 sai 20 30
3 ram 30 60
4 shyam 40 100
5 hari 50 150
Please provide the output only with using LAG function not Looking with Retain Keyword.
Regards,
Jaiganesh
@jaiganesh wrote:
Mentioned Link does not help me much with respect to certain with LAG Function cumulative addition.
Can you provide anything else
Because the lag() function is NOT the right tool for your task, period.
@jaiganesh wrote:
Mentioned Link does not help me much with respect to certain with LAG Function cumulative addition.
Can you provide anything else
Many people are saying that LAG is not the right tool for this purpose. What more do you need? Furthermore, there is no documentation that we can point to that says function X should not be used to perform Task A. People don't write documentation listing all the things that a specific function can't do.
Hi @jaiganesh,
I fully agree with all the others that the LAG function is not ideal for this task (to say the least). But you could use this (inefficient) code in order to apply it:
data _null_;
set kk end=last;
if _n_=1 then call execute('data want; set kk;');
call execute('new_marks=sum(marks,lag(new_marks));');
if last then call execute('run;');
run;
Edit: Or this:
data want;
set kk;
do _n_=0, 1;
if _n_ then output;
new_marks=sum(marks,lag(new_marks));
end;
run;
I'm really puzzled.
What is the functional difference between
data want;
set kk;
do _n_=0, 1;
if _n_ then output;
new_marks=sum(marks,lag(new_marks));
end;
run;
and
data want2;
set kk;
new_marks=sum(marks,lag(new_marks));
output;
new_marks=sum(marks,lag(new_marks));
run;
?
(the first works, the second not, although the summarization and output statements execute in the same order)
The crucial point is that each occurrence of the LAG function in the DATA step corresponds to a separate queue.
So, in the first code one queue operates twice: The idea is that the queue should always output the previous cumulative sum (to be added to the current value of MARKS in order to obtain the desired value of NEW_MARKS). This happens in the first iteration of the DO loop. The second iteration of the loop is required because the calculated value of NEW_MARKS must be fed into the queue, so that it will be output in the next call of the LAG function (in the next iteration of the DATA step). This cannot be accomplished during the calculation because obviously the value just being calculated is not available yet and hence a missing value is fed into the queue (since NEW_MARKS is not RETAINed). This missing value is then returned in the second iteration of the DO loop, leading (temporarily) to a "wrong" value of NEW_MARKS, namely NEW_MARKS=MARKS, but this doesn't matter because it happens after the OUTPUT statement.
The second code creates two separate queues. The first one does always the same in each iteration of the DATA step:
Hence, the value of NEW_MARKS which is output is always SUM(MARKS, .) = MARKS.
The second queue stores the value of NEW_MARKS created in the previous assignment statement (i.e. [with the data in dataset kk] 10, 20, 30, ... -- the values of MARKS, see above) and outputs the previous one, thus assigning temporary values 10, 30, 50, 70, ... to NEW_MARKS, but these aren't output to dataset WANT2 anyway, nor are they retained. So, the second assignment statement is useless.
That's the explanation. Two queues vs. one queue.
Thank you very much.
Maxim 13 in action.
That was really interesting, thank you so much! - I am sorry to have mislead @jaiganesh, but I never realized that lag is not coupled to the input buffer, but works on any assignment, so a construction like this works:
data _null_;
do a = 1 to 5;
b = lag(a);
c = lag(b);
d = lag(c);
e = lag(d);
put a= b= c= d= e=;
end;
run; a=1 b=. c=. d=. e=. a=2 b=1 c=. d=. e=. a=3 b=2 c=1 d=. e=. a=4 b=3 c=2 d=1 e=. a=5 b=4 c=3 d=2 e=1
Hi @jaiganesh
The Lag function returns the value from a previous input record. Lag1 (or just Lag) returns the value from the record preceding the current record, Lag2 from the one before that and so on. There is no limit on the number of Lags other than the available memory, so if you have 1000 records, you can use Lag1 - Lag999.
But you cannot use Lag on variables created in the data step, so you cannot accumulate values using Lag. Here is an example using Lag on your data to give the wanted result, so you can see what happens:
data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;
data want; set kk;
lag1marks = lag1(marks);
lag2marks = lag2(marks);
lag3marks = lag3(marks);
lag4marks = lag4(marks);
New_marks = sum(marks,lag1marks,lag2marks,lag3marks,lag4marks);
run;
This works with your 5 observations, because the code uses Lag1 - Lag4, but it cannot be made dynamic, at least not without a lot of macro coding. If there were more observations, it would give a rolling sum over the last 5 observarions, which might be useful in some cases, but to use Lag to make a total summation over a column in a data set is - politely expressed - not advisable.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.