DATA Step, Macro, Functions and more

Lag function help

Accepted Solution Solved
Reply
Contributor
Posts: 28
Accepted Solution

Lag function help

Hi

 

I have a table (simplified for this) which has 3 columns. An ID column, a date column and a score column. I'll call this table "Have"

 

e.g.

 

ID       date                 score

A    01/01/2015            20

A    02/01/2015            30

B    01/01/2015            20

B    02/01/2015            20 

B    02/01/2015            30

 

I want to add a column which identifies whether the scores have increased over time. I have tried a lag function like so;

 

data want;
set have;
length flag $20.;
if ID= lag(ID) and score> lag(score) then flag = 'Increased';
else if ID= lag(ID) and score< lag(score) then flag = 'Decreased';

else if ID= lag(ID) then flag = 'No Change'
else flag = '';
run; 

 

This seems to work on all cases except when the dates are the same, So I end up with the following (note the entry at the bottom should have increased)

 

ID       date                 score       flag

A    01/01/2015            20        

A    02/01/2015            30         Increased

B    01/01/2015            20

B    02/01/2015            20         No change

B    02/01/2015            30

 

Is there something I'm doing wrong? (first time using the lag function)


Accepted Solutions
Solution
‎03-01-2016 06:31 AM
Respected Advisor
Posts: 3,799

Re: Lag function help

[ Edited ]

You want to use BY ID; to process the data in groups.  You can do it with LAG but that is too hard.  Then you are just left with DIF functioin instead of LAG.  Set DIF to missing at first.id.

 

data score;
   input ID:$1. date:mmddyy. score;
   format date mmddyy.;
   cards;
A    01/01/2015            20
A    02/01/2015            30
B    01/01/2015            20
B    02/01/2015            20 
B    02/01/2015            30
C    01/01/2015            20
C    02/01/2015            30
C    01/01/2015            20
C    02/01/2015            20 
C    02/01/2015            30
;;;;
   run;
data score2;
   set score;
   by id;
   delta = sign(dif(score));
   if first.id then delta=.;
   run;
proc print;
   run;

Capture.PNG

 

View solution in original post


All Replies
Super Contributor
Posts: 490

Re: Lag function help

Your code have nothing to do with the date variable.
You forgot the semicolon in the third if statement.

Contributor
Posts: 28

Re: Lag function help

Posted in reply to mohamed_zaki

The missing semicolon is a typo when copying across. 

 

That's why I'm confused though, my query has nothing to do with the date variable so why hasn't the flag been populated for the bottom row? When I look at my full table it is the same every time the date is the same. I have sorted the data using ID and date but that's because I want to see changes over time for each person.

Solution
‎03-01-2016 06:31 AM
Respected Advisor
Posts: 3,799

Re: Lag function help

[ Edited ]

You want to use BY ID; to process the data in groups.  You can do it with LAG but that is too hard.  Then you are just left with DIF functioin instead of LAG.  Set DIF to missing at first.id.

 

data score;
   input ID:$1. date:mmddyy. score;
   format date mmddyy.;
   cards;
A    01/01/2015            20
A    02/01/2015            30
B    01/01/2015            20
B    02/01/2015            20 
B    02/01/2015            30
C    01/01/2015            20
C    02/01/2015            30
C    01/01/2015            20
C    02/01/2015            20 
C    02/01/2015            30
;;;;
   run;
data score2;
   set score;
   by id;
   delta = sign(dif(score));
   if first.id then delta=.;
   run;
proc print;
   run;

Capture.PNG

 

Contributor
Posts: 28

Re: Lag function help

Posted in reply to data_null__

Thanks, I did not know the dif function.

 

Still not sure why my original didn't work but this does so thanks.

Respected Advisor
Posts: 3,799

Re: Lag function help

@troopon your original program doesn't work for the same reason that @RW9 program does not work.  You are doing conditional lag which does not work in this situation.  LAG is not last OBS it is last PUSH.

Super User
Super User
Posts: 7,942

Re: Lag function help

You may also want to consider a slight modification to the code,

data have;
  input ID $ date $ score;
datalines;
A 01/01/2015 20
A 02/01/2015 30
B 01/01/2015 20
B 02/01/2015 20 
B 02/01/2015 30
;
run;

data want;
  set have;
  length flag $20.;
  if id=lag(id) then do;
    select;
      when(score > lag(score)) flag="Increased";
      when(score < lag(score)) flag="Decreased";
      otherwise flag="No change";
    end;
  end;
run;

The above illustrates the select clause which is useful for multiple if statements and when wrapped in the if statement clearly shows it only happens when id matches previous instance. 

Contributor
Posts: 28

Re: Lag function help

Curious question.

 

What is the difference between 

 

if id=lag(id) then do;
    select;
      when(score > lag(score)) flag="Increased";
      when(score < lag(score)) flag="Decreased";
      otherwise flag="No change";

and

 

 

if id=lag(id) then do;
      if(score > lag(score)) flag="Increased";
      else if(score < lag(score)) flag="Decreased";
      else flag="No change";
Super User
Super User
Posts: 7,942

Re: Lag function help

Well, there is not a lot of difference between the two logically as posted there - but then you have changed the statements when you posted this (note the then is missing from your if's here).  In your previous example:

if ID= lag(ID) and score> lag(score) then flag = 'Increased';
else if ID= lag(ID) and score< lag(score) then flag = 'Decreased';
else if ID= lag(ID) then flag = 'No Change' 
else flag = '';

The id=lag(id) part was duplicated several times.  That is one point.  The second point is on readability, so for two options then its not much difference, however when you have lots of conditions, it is far easier to read the select statement rather than a lot of if/if else statements, you can also follow down the page where the breaks.  

 

Respected Advisor
Posts: 3,799

Re: Lag function help

@RW9 your program doesn't work.  You are doing conditional lag which doesn't work here..

Capture.PNG

Super User
Super User
Posts: 7,942

Re: Lag function help

Posted in reply to data_null__

Yes, it worked for the small test data I had, so didn't look at it further.  I only really wanted to show the select statement.

Trusted Advisor
Posts: 1,117

Re: Lag function help

@RW9: Your code introduces conditional execution of LAG functions in a twofold way:

  1. The two LAG function calls in the SELECT statement are not executed in case that the IF condition is false.
  2. The LAG function call in the second WHEN statement is not executed if the previous WHEN condition is true.

The error pointed out by data_null__ is due to item 1. To see the impact of item 2, you can add an 11th observation to data_null__'s sample data with ID='C' and score=20. This observation will be incorrectly flagged "No change" (rather than "Decreased") due to item 2.

 

Trusted Advisor
Posts: 1,117

Re: Lag function help

[ Edited ]

@troopon:

  1. Given your sample data, the code and output shown in your initial post are not consistent (even after adding the missing semicolon in the second ELSE IF statement and correcting the case mismatch "Change" vs. "change"): Contrary to your statement, the last observation would receive FLAG='Increased'. The first IF condition is processed for every observation. Hence, for the last observation it evaluates to 'B'='B' and 30>20, which is true.

  2. As @data_null__ has pointed out already, your code doesn't work correctly, because
  • some of the LAG function calls are conditional (ELSE branches are not executed if a previous IF condition was met!). This means, depending on the input data, it can happen that they are not executed for a particular observation. AND
  • The intention is (apparently) that lag(variable) should give the value of variable of the previous observation. But this is not the case if the function was not executed for the previous observation.

With your sample data, however, this error is not visible in the output. In contrast, with data_null__'s sample data it would become apparent, as shown below.

 

output.png

(EDIT: The sort order of this test data set might not be what you want in practice, but it doesn't interfere with my explanations of the LAG function.)

 

The third observation of ID C is incorrectly flagged "No Change," because for the previous observation the first IF condition was met. As a consequence, the queue behind the second lag(score) was not touched. So, when the third obs. of ID C is processed, it still contains the value from the first obs. of ID C. The condition 'C'='C' and 20=20 (comparing values from the first and third observation of ID C) yields the incorrect flag.

 

I think, two very important points for understanding the LAG function are that

  1. for each place where the LAG function occurs in the code a separate queue is created
  2. if and only if the function call is executed, the LAG function returns the value from the queue and writes the current value of the variable in its argument to the queue.

Please note that the DIF function is closely related to the LAG function and therefore requires similar attention.

 

As mohamed_zaki has mentioned, variable DATE has no impact on variable FLAG as it is not involved in its definition.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 658 views
  • 1 like
  • 5 in conversation