- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I have a table (simplified for this) which has 3 columns. An ID column, a date column and a score column. I'll call this table "Have"
e.g.
ID date score
A 01/01/2015 20
A 02/01/2015 30
B 01/01/2015 20
B 02/01/2015 20
B 02/01/2015 30
I want to add a column which identifies whether the scores have increased over time. I have tried a lag function like so;
data want;
set have;
length flag $20.;
if ID= lag(ID) and score> lag(score) then flag = 'Increased';
else if ID= lag(ID) and score< lag(score) then flag = 'Decreased';
else if ID= lag(ID) then flag = 'No Change'
else flag = '';
run;
This seems to work on all cases except when the dates are the same, So I end up with the following (note the entry at the bottom should have increased)
ID date score flag
A 01/01/2015 20
A 02/01/2015 30 Increased
B 01/01/2015 20
B 02/01/2015 20 No change
B 02/01/2015 30
Is there something I'm doing wrong? (first time using the lag function)
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You want to use BY ID; to process the data in groups. You can do it with LAG but that is too hard. Then you are just left with DIF functioin instead of LAG. Set DIF to missing at first.id.
data score;
input ID:$1. date:mmddyy. score;
format date mmddyy.;
cards;
A 01/01/2015 20
A 02/01/2015 30
B 01/01/2015 20
B 02/01/2015 20
B 02/01/2015 30
C 01/01/2015 20
C 02/01/2015 30
C 01/01/2015 20
C 02/01/2015 20
C 02/01/2015 30
;;;;
run;
data score2;
set score;
by id;
delta = sign(dif(score));
if first.id then delta=.;
run;
proc print;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your code have nothing to do with the date variable.
You forgot the semicolon in the third if statement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The missing semicolon is a typo when copying across.
That's why I'm confused though, my query has nothing to do with the date variable so why hasn't the flag been populated for the bottom row? When I look at my full table it is the same every time the date is the same. I have sorted the data using ID and date but that's because I want to see changes over time for each person.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You want to use BY ID; to process the data in groups. You can do it with LAG but that is too hard. Then you are just left with DIF functioin instead of LAG. Set DIF to missing at first.id.
data score;
input ID:$1. date:mmddyy. score;
format date mmddyy.;
cards;
A 01/01/2015 20
A 02/01/2015 30
B 01/01/2015 20
B 02/01/2015 20
B 02/01/2015 30
C 01/01/2015 20
C 02/01/2015 30
C 01/01/2015 20
C 02/01/2015 20
C 02/01/2015 30
;;;;
run;
data score2;
set score;
by id;
delta = sign(dif(score));
if first.id then delta=.;
run;
proc print;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, I did not know the dif function.
Still not sure why my original didn't work but this does so thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You may also want to consider a slight modification to the code,
data have; input ID $ date $ score; datalines; A 01/01/2015 20 A 02/01/2015 30 B 01/01/2015 20 B 02/01/2015 20 B 02/01/2015 30 ; run; data want; set have; length flag $20.; if id=lag(id) then do; select; when(score > lag(score)) flag="Increased"; when(score < lag(score)) flag="Decreased"; otherwise flag="No change"; end; end; run;
The above illustrates the select clause which is useful for multiple if statements and when wrapped in the if statement clearly shows it only happens when id matches previous instance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Curious question.
What is the difference between
if id=lag(id) then do; select; when(score > lag(score)) flag="Increased"; when(score < lag(score)) flag="Decreased"; otherwise flag="No change";
and
if id=lag(id) then do; if(score > lag(score)) flag="Increased"; else if(score < lag(score)) flag="Decreased"; else flag="No change";
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Well, there is not a lot of difference between the two logically as posted there - but then you have changed the statements when you posted this (note the then is missing from your if's here). In your previous example:
if ID= lag(ID) and score> lag(score) then flag = 'Increased'; else if ID= lag(ID) and score< lag(score) then flag = 'Decreased'; else if ID= lag(ID) then flag = 'No Change' else flag = '';
The id=lag(id) part was duplicated several times. That is one point. The second point is on readability, so for two options then its not much difference, however when you have lots of conditions, it is far easier to read the select statement rather than a lot of if/if else statements, you can also follow down the page where the breaks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@RW9 your program doesn't work. You are doing conditional lag which doesn't work here..
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes, it worked for the small test data I had, so didn't look at it further. I only really wanted to show the select statement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@RW9: Your code introduces conditional execution of LAG functions in a twofold way:
- The two LAG function calls in the SELECT statement are not executed in case that the IF condition is false.
- The LAG function call in the second WHEN statement is not executed if the previous WHEN condition is true.
The error pointed out by data_null__ is due to item 1. To see the impact of item 2, you can add an 11th observation to data_null__'s sample data with ID='C' and score=20. This observation will be incorrectly flagged "No change" (rather than "Decreased") due to item 2.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Given your sample data, the code and output shown in your initial post are not consistent (even after adding the missing semicolon in the second ELSE IF statement and correcting the case mismatch "Change" vs. "change"): Contrary to your statement, the last observation would receive FLAG='Increased'. The first IF condition is processed for every observation. Hence, for the last observation it evaluates to 'B'='B' and 30>20, which is true.
- As @data_null__ has pointed out already, your code doesn't work correctly, because
- some of the LAG function calls are conditional (ELSE branches are not executed if a previous IF condition was met!). This means, depending on the input data, it can happen that they are not executed for a particular observation. AND
- The intention is (apparently) that lag(variable) should give the value of variable of the previous observation. But this is not the case if the function was not executed for the previous observation.
With your sample data, however, this error is not visible in the output. In contrast, with data_null__'s sample data it would become apparent, as shown below.
(EDIT: The sort order of this test data set might not be what you want in practice, but it doesn't interfere with my explanations of the LAG function.)
The third observation of ID C is incorrectly flagged "No Change," because for the previous observation the first IF condition was met. As a consequence, the queue behind the second lag(score) was not touched. So, when the third obs. of ID C is processed, it still contains the value from the first obs. of ID C. The condition 'C'='C' and 20=20 (comparing values from the first and third observation of ID C) yields the incorrect flag.
I think, two very important points for understanding the LAG function are that
- for each place where the LAG function occurs in the code a separate queue is created
- if and only if the function call is executed, the LAG function returns the value from the queue and writes the current value of the variable in its argument to the queue.
Please note that the DIF function is closely related to the LAG function and therefore requires similar attention.
As mohamed_zaki has mentioned, variable DATE has no impact on variable FLAG as it is not involved in its definition.
[Edit 2019-04-30: Reattached screenshot, which had been deleted by mistake.]