Solved: Re: LAG Function

jaiganesh · Posted 02-27-2019 10:42 AM

Hello,

Can anybody help me to create the code which make the cumulative sum for the variable Marks used in below data set with LAG Keyword/Function.

Input:

data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;

output Should be

id names marks New_marks

1 jai 10 10
2 sai 20 30
3 ram 30 60
4 shyam 40 100
5 hari 50 150

Please provide the output only with using LAG function not Looking with Retain Keyword.

Regards,

Jaiganesh

ErikLund_Jensen · Posted 02-27-2019 03:31 PM

Hi @jaiganesh

The Lag function returns the value from a previous input record. Lag1 (or just Lag) returns the value from the record preceding the current record, Lag2 from the one before that and so on. There is no limit on the number of Lags other than the available memory, so if you have 1000 records, you can use Lag1 - Lag999.

But you cannot use Lag on variables created in the data step, so you cannot accumulate values using Lag. Here is an example using Lag on your data to give the wanted result, so you can see what happens:

data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;

data want; set kk; 
	lag1marks = lag1(marks);
	lag2marks = lag2(marks);
	lag3marks = lag3(marks);
	lag4marks = lag4(marks);
	New_marks = sum(marks,lag1marks,lag2marks,lag3marks,lag4marks);
run;

This works with your 5 observations, because the code uses Lag1 - Lag4, but it cannot be made dynamic, at least not without a lot of macro coding. If there were more observations, it would give a rolling sum over the last 5 observarions, which might be useful in some cases, but to use Lag to make a total summation over a column in a data set is - politely expressed - not advisable.

View solution in original post

andreas_lds · Posted 02-27-2019 10:45 AM

And why don't you want to use retain? Have you read the examples in the online docs? This is such an easy task, it should be solvable with some research.

jaiganesh · Posted 02-27-2019 11:51 AM

I've been using Retain Keyword, one of my colleague has raise the concern of LAG function functionality and it's application in programming, Let me know if you need any other information.

Kurt_Bremser · Posted 02-27-2019 11:59 AM

Maxim 14: Use the Right Tool, in this case retain.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 02-27-2019 04:08 PM

The function LAG is no appropriate here.

Ask your colleague to enlighten us and show us how it's done.

Note this was already asked by someone else and on the same day (and with equivalent replies).

https://communities.sas.com/t5/New-SAS-User/How-to-calculate-cumulative-sum-using-Lag-function/td-p/...

Colleague or classmate?

High-Performance SAS Coding - Third Edition

Reeza · Posted 02-27-2019 01:44 PM

If you show an attempt, happy to help you work through it.

The documentation has some examples that are pretty helpful if you work through them first:

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=lefunctionsref&docsetTarg...

@jaiganesh wrote:

Hello,

Can anybody help me to create the code which make the cumulative sum for the variable Marks used in below data set with LAG Keyword/Function.

Input:

data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;

output Should be

id names marks New_marks
1 jai 10 10
2 sai 20 30
3 ram 30 60
4 shyam 40 100
5 hari 50 150

Please provide the output only with using LAG function not Looking with Retain Keyword.

Regards,

Jaiganesh

jaiganesh · Posted 02-27-2019 03:01 PM

Mentioned Link does not help me much with respect to certain with LAG Function cumulative addition.

Can you provide anything else

Kurt_Bremser · Posted 02-27-2019 03:03 PM

@jaiganesh wrote:
Mentioned Link does not help me much with respect to certain with LAG Function cumulative addition.

Can you provide anything else

Because the lag() function is NOT the right tool for your task, period.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PaigeMiller · Posted 02-27-2019 03:13 PM

@jaiganesh wrote:
Mentioned Link does not help me much with respect to certain with LAG Function cumulative addition.

Can you provide anything else

Many people are saying that LAG is not the right tool for this purpose. What more do you need? Furthermore, there is no documentation that we can point to that says function X should not be used to perform Task A. People don't write documentation listing all the things that a specific function can't do.

--
Paige Miller

FreelanceReinh · Posted 02-27-2019 05:14 PM

Hi @jaiganesh,

I fully agree with all the others that the LAG function is not ideal for this task (to say the least). But you could use this (inefficient) code in order to apply it:

data _null_;
set kk end=last;
if _n_=1 then call execute('data want; set kk;');
call execute('new_marks=sum(marks,lag(new_marks));');
if last then call execute('run;');
run;

Edit: Or this:

data want;
set kk;
do _n_=0, 1;
  if _n_ then output;
  new_marks=sum(marks,lag(new_marks));
end;
run;

Kurt_Bremser · Posted 03-01-2019 07:39 AM

I'm really puzzled.

What is the functional difference between

data want;
set kk;
do _n_=0, 1;
  if _n_ then output;
  new_marks=sum(marks,lag(new_marks));
end;
run;

and

data want2;
set kk;
new_marks=sum(marks,lag(new_marks));
output;
new_marks=sum(marks,lag(new_marks));
run;

?

(the first works, the second not, although the summarization and output statements execute in the same order)

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 03-01-2019 08:53 AM

The crucial point is that each occurrence of the LAG function in the DATA step corresponds to a separate queue.

So, in the first code one queue operates twice: The idea is that the queue should always output the previous cumulative sum (to be added to the current value of MARKS in order to obtain the desired value of NEW_MARKS). This happens in the first iteration of the DO loop. The second iteration of the loop is required because the calculated value of NEW_MARKS must be fed into the queue, so that it will be output in the next call of the LAG function (in the next iteration of the DATA step). This cannot be accomplished during the calculation because obviously the value just being calculated is not available yet and hence a missing value is fed into the queue (since NEW_MARKS is not RETAINed). This missing value is then returned in the second iteration of the DO loop, leading (temporarily) to a "wrong" value of NEW_MARKS, namely NEW_MARKS=MARKS, but this doesn't matter because it happens after the OUTPUT statement.

The second code creates two separate queues. The first one does always the same in each iteration of the DATA step:

It outputs a missing value (in the first iteration of the DATA step this is the usual behavior in the first call of any LAG function, in subsequent iterations it's because of item 2 below).
It stores a missing value because the unretained variable NEW_MARKS has been initialized to missing and not yet assigned a value at the time when the corresponding LAG function is called.

Hence, the value of NEW_MARKS which is output is always SUM(MARKS, .) = MARKS.

The second queue stores the value of NEW_MARKS created in the previous assignment statement (i.e. [with the data in dataset kk] 10, 20, 30, ... -- the values of MARKS, see above) and outputs the previous one, thus assigning temporary values 10, 30, 50, 70, ... to NEW_MARKS, but these aren't output to dataset WANT2 anyway, nor are they retained. So, the second assignment statement is useless.

Kurt_Bremser · Posted 03-01-2019 10:01 AM

That's the explanation. Two queues vs. one queue.

Thank you very much.

Maxim 13 in action.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ErikLund_Jensen · Posted 03-01-2019 01:03 PM

@FreelanceReinh

That was really interesting, thank you so much! - I am sorry to have mislead @jaiganesh, but I never realized that lag is not coupled to the input buffer, but works on any assignment, so a construction like this works:

data _null_;
    do a = 1 to 5;
        b = lag(a);
        c = lag(b);
        d = lag(c);
        e = lag(d);
        put a= b= c= d= e=;
    end;
run;

a=1 b=. c=. d=. e=.
a=2 b=1 c=. d=. e=.
a=3 b=2 c=1 d=. e=.
a=4 b=3 c=2 d=1 e=.
a=5 b=4 c=3 d=2 e=1

ErikLund_Jensen · Posted 02-27-2019 03:31 PM

Hi @jaiganesh

The Lag function returns the value from a previous input record. Lag1 (or just Lag) returns the value from the record preceding the current record, Lag2 from the one before that and so on. There is no limit on the number of Lags other than the available memory, so if you have 1000 records, you can use Lag1 - Lag999.

But you cannot use Lag on variables created in the data step, so you cannot accumulate values using Lag. Here is an example using Lag on your data to give the wanted result, so you can see what happens:

data kk;
input id names$ marks;
cards;
1 jai 10
2 sai 20
3 ram 30
4 shyam 40
5 hari 50
;
run;

data want; set kk; 
	lag1marks = lag1(marks);
	lag2marks = lag2(marks);
	lag3marks = lag3(marks);
	lag4marks = lag4(marks);
	New_marks = sum(marks,lag1marks,lag2marks,lag3marks,lag4marks);
run;

This works with your 5 observations, because the code uses Lag1 - Lag4, but it cannot be made dynamic, at least not without a lot of macro coding. If there were more observations, it would give a rolling sum over the last 5 observarions, which might be useful in some cases, but to use Lag to make a total summation over a column in a data set is - politely expressed - not advisable.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away