BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
troopon
Calcite | Level 5

Hi

 

I have a table (simplified for this) which has 3 columns. An ID column, a date column and a score column. I'll call this table "Have"

 

e.g.

 

ID       date                 score

A    01/01/2015            20

A    02/01/2015            30

B    01/01/2015            20

B    02/01/2015            20 

B    02/01/2015            30

 

I want to add a column which identifies whether the scores have increased over time. I have tried a lag function like so;

 

data want;
set have;
length flag $20.;
if ID= lag(ID) and score> lag(score) then flag = 'Increased';
else if ID= lag(ID) and score< lag(score) then flag = 'Decreased';

else if ID= lag(ID) then flag = 'No Change'
else flag = '';
run; 

 

This seems to work on all cases except when the dates are the same, So I end up with the following (note the entry at the bottom should have increased)

 

ID       date                 score       flag

A    01/01/2015            20        

A    02/01/2015            30         Increased

B    01/01/2015            20

B    02/01/2015            20         No change

B    02/01/2015            30

 

Is there something I'm doing wrong? (first time using the lag function)

1 ACCEPTED SOLUTION

Accepted Solutions
data_null__
Jade | Level 19

You want to use BY ID; to process the data in groups.  You can do it with LAG but that is too hard.  Then you are just left with DIF functioin instead of LAG.  Set DIF to missing at first.id.

 

data score;
   input ID:$1. date:mmddyy. score;
   format date mmddyy.;
   cards;
A    01/01/2015            20
A    02/01/2015            30
B    01/01/2015            20
B    02/01/2015            20 
B    02/01/2015            30
C    01/01/2015            20
C    02/01/2015            30
C    01/01/2015            20
C    02/01/2015            20 
C    02/01/2015            30
;;;;
   run;
data score2;
   set score;
   by id;
   delta = sign(dif(score));
   if first.id then delta=.;
   run;
proc print;
   run;

Capture.PNG

 

View solution in original post

12 REPLIES 12
mohamed_zaki
Barite | Level 11

Your code have nothing to do with the date variable.
You forgot the semicolon in the third if statement.

troopon
Calcite | Level 5

The missing semicolon is a typo when copying across. 

 

That's why I'm confused though, my query has nothing to do with the date variable so why hasn't the flag been populated for the bottom row? When I look at my full table it is the same every time the date is the same. I have sorted the data using ID and date but that's because I want to see changes over time for each person.

data_null__
Jade | Level 19

You want to use BY ID; to process the data in groups.  You can do it with LAG but that is too hard.  Then you are just left with DIF functioin instead of LAG.  Set DIF to missing at first.id.

 

data score;
   input ID:$1. date:mmddyy. score;
   format date mmddyy.;
   cards;
A    01/01/2015            20
A    02/01/2015            30
B    01/01/2015            20
B    02/01/2015            20 
B    02/01/2015            30
C    01/01/2015            20
C    02/01/2015            30
C    01/01/2015            20
C    02/01/2015            20 
C    02/01/2015            30
;;;;
   run;
data score2;
   set score;
   by id;
   delta = sign(dif(score));
   if first.id then delta=.;
   run;
proc print;
   run;

Capture.PNG

 

troopon
Calcite | Level 5

Thanks, I did not know the dif function.

 

Still not sure why my original didn't work but this does so thanks.

data_null__
Jade | Level 19

@troopon your original program doesn't work for the same reason that @RW9 program does not work.  You are doing conditional lag which does not work in this situation.  LAG is not last OBS it is last PUSH.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

You may also want to consider a slight modification to the code,

data have;
  input ID $ date $ score;
datalines;
A 01/01/2015 20
A 02/01/2015 30
B 01/01/2015 20
B 02/01/2015 20 
B 02/01/2015 30
;
run;

data want;
  set have;
  length flag $20.;
  if id=lag(id) then do;
    select;
      when(score > lag(score)) flag="Increased";
      when(score < lag(score)) flag="Decreased";
      otherwise flag="No change";
    end;
  end;
run;

The above illustrates the select clause which is useful for multiple if statements and when wrapped in the if statement clearly shows it only happens when id matches previous instance. 

troopon
Calcite | Level 5

Curious question.

 

What is the difference between 

 

if id=lag(id) then do;
    select;
      when(score > lag(score)) flag="Increased";
      when(score < lag(score)) flag="Decreased";
      otherwise flag="No change";

and

 

 

if id=lag(id) then do;
      if(score > lag(score)) flag="Increased";
      else if(score < lag(score)) flag="Decreased";
      else flag="No change";
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, there is not a lot of difference between the two logically as posted there - but then you have changed the statements when you posted this (note the then is missing from your if's here).  In your previous example:

if ID= lag(ID) and score> lag(score) then flag = 'Increased';
else if ID= lag(ID) and score< lag(score) then flag = 'Decreased';
else if ID= lag(ID) then flag = 'No Change' 
else flag = '';

The id=lag(id) part was duplicated several times.  That is one point.  The second point is on readability, so for two options then its not much difference, however when you have lots of conditions, it is far easier to read the select statement rather than a lot of if/if else statements, you can also follow down the page where the breaks.  

 

data_null__
Jade | Level 19

@RW9 your program doesn't work.  You are doing conditional lag which doesn't work here..

Capture.PNG

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yes, it worked for the small test data I had, so didn't look at it further.  I only really wanted to show the select statement.

FreelanceReinh
Jade | Level 19

@RW9: Your code introduces conditional execution of LAG functions in a twofold way:

  1. The two LAG function calls in the SELECT statement are not executed in case that the IF condition is false.
  2. The LAG function call in the second WHEN statement is not executed if the previous WHEN condition is true.

The error pointed out by data_null__ is due to item 1. To see the impact of item 2, you can add an 11th observation to data_null__'s sample data with ID='C' and score=20. This observation will be incorrectly flagged "No change" (rather than "Decreased") due to item 2.

 

FreelanceReinh
Jade | Level 19

@troopon:

  1. Given your sample data, the code and output shown in your initial post are not consistent (even after adding the missing semicolon in the second ELSE IF statement and correcting the case mismatch "Change" vs. "change"): Contrary to your statement, the last observation would receive FLAG='Increased'. The first IF condition is processed for every observation. Hence, for the last observation it evaluates to 'B'='B' and 30>20, which is true.

  2. As @data_null__ has pointed out already, your code doesn't work correctly, because
  • some of the LAG function calls are conditional (ELSE branches are not executed if a previous IF condition was met!). This means, depending on the input data, it can happen that they are not executed for a particular observation. AND
  • The intention is (apparently) that lag(variable) should give the value of variable of the previous observation. But this is not the case if the function was not executed for the previous observation.

With your sample data, however, this error is not visible in the output. In contrast, with data_null__'s sample data it would become apparent, as shown below.

 

output.png

(EDIT: The sort order of this test data set might not be what you want in practice, but it doesn't interfere with my explanations of the LAG function.)

 

The third observation of ID C is incorrectly flagged "No Change," because for the previous observation the first IF condition was met. As a consequence, the queue behind the second lag(score) was not touched. So, when the third obs. of ID C is processed, it still contains the value from the first obs. of ID C. The condition 'C'='C' and 20=20 (comparing values from the first and third observation of ID C) yields the incorrect flag.

 

I think, two very important points for understanding the LAG function are that

  1. for each place where the LAG function occurs in the code a separate queue is created
  2. if and only if the function call is executed, the LAG function returns the value from the queue and writes the current value of the variable in its argument to the queue.

Please note that the DIF function is closely related to the LAG function and therefore requires similar attention.

 

As mohamed_zaki has mentioned, variable DATE has no impact on variable FLAG as it is not involved in its definition.

 

[Edit 2019-04-30: Reattached screenshot, which had been deleted by mistake.]

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 3744 views
  • 1 like
  • 5 in conversation