- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi! I'm a first time poster, but 10 minutes of online searching didn't yield a similar result, so I apologize if I just failed at finding a similar, past post w/ a solution.
I have a dataset called 'untaggedData' that is comprised of two columns: 'id' and 'year'. I've sorted it by id first and then by year (descending). Most id's have multiple years.
My goal is to label the oldest year (bottom row) for any given id where the label tells me whether or not a >= 2 year difference occurs in any consecutive years for that given id.
While I have years of coding experience in other languages like Python and VBA, I'm new to SAS and its many unfamiliar mannerisms. The following code seemed like a line-efficient way to accomplish my goal, but I'm discovering that mixing the lag() function with IF-THEN statements is not intuitive.
data taggedData;
set work.untaggedData;
category = "A";
if lag(id) = id and lag(year) - year >= 2 then category = "B";
if lag(id) = id and lag(category) = "B" then category = "B";
run;
Here's a suitable mock-up of the result my inadequate code:
ID | YEAR | CATEGORY |
1 | 2022 | A |
1 | 2021 | A |
1 | 2006 | B |
1 | 2006 | B |
1 | 2005 | A |
1 | 2003 | B |
1 | 2003 | B |
1 | 2003 | A |
1 | 2002 | A |
2 | 2022 | A |
2 | 2021 | A |
... | ... | ... |
So I don't understand why SAS is not labeling all subsequent rows (per ID) with "B" after it occurs the first time?
Though my primary question is how to achieve my goal? How would you go about labeling the oldest year (bottom row) for any given id so that it says whether or not a >= 2 year difference occurs in any consecutive years for that given id?
Thanks,
Zach
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The LAG() function sets up a queue, where an element is taken from the "top" and a new one inserted at the "bottom" whenever the function is called. So, calling the function conditionally is usually a bad idea, with confusing results.
But you call it unconditionally, which is good.
But at the time you call it for lag(category), category will be "A" if the first IF was not true, and that "A" goes into the queue.
Use RETAIN and BY instead.
data taggedData;
set work.untaggedData;
by id;
retain category;
if first.id then category = "A";
if not first.id and lag(year) - year >= 2 then category = "B";
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The LAG() function sets up a queue, where an element is taken from the "top" and a new one inserted at the "bottom" whenever the function is called. So, calling the function conditionally is usually a bad idea, with confusing results.
But you call it unconditionally, which is good.
But at the time you call it for lag(category), category will be "A" if the first IF was not true, and that "A" goes into the queue.
Use RETAIN and BY instead.
data taggedData;
set work.untaggedData;
by id;
retain category;
if first.id then category = "A";
if not first.id and lag(year) - year >= 2 then category = "B";
run;