I am wondering why these two statements give different results. The difference is where I set the If condition:
data dt;
set dt;
lyear = lag(year);
lfirm = lag(firm);
if lfirm=firm;
run;
data dt;
set dt;
lyear = lag(year);
if lfirm=firm then lfirm = lag(firm);
run;
The IF statement without THEN (IF <condition>) is equivalent to IF NOT <condition> THEN DELETE;
So, your first data step could also be written as
data dt;
set dt;
lyear = lag(year);
lfirm = lag(firm);
if lfirm ne firm then delete;
run;
The second data step is (as others have also remarked) using a very dangerous approach: calling a LAG function conditionally.
Here's what the LAG function does:
It has an internal queue. Not a stack as it was called by @ChrisNZ , values in a queue are FIFO (first in, first out), while values in a stack are LIFO (last in, first out). This makes no difference when using the basic LAG function (the queue has a length of one value only), but when using e.g. LAG2 or LAG3, the queue is longer. Sorry for being a bit pedantic...
Stuff only gets pushed into the queue when the LAG function is called.
So what happens in your second data step? The statement
if lfirm=firm then lfirm = lag(firm);
does not do anything meaningful in this context. LFIRM has not been set to a value, so it is missing to begin with. The LAG function will only be called when FIRM is also missing, meaning that all the LAG function does is pull the old missing value from the queue, and push a new missing value (the current value of FIRM) into the queue. This means that LFIRM will always be missing.
Never use the LAG function inside a condition - you will get unpredictable results.
The lag function manages its stack when it's called.
If you don't call it for every observation (for example you only call it when a condition is true), you don't get every value.
This is might a desirable outcome, but most often it is not.
The IF statement without THEN (IF <condition>) is equivalent to IF NOT <condition> THEN DELETE;
So, your first data step could also be written as
data dt;
set dt;
lyear = lag(year);
lfirm = lag(firm);
if lfirm ne firm then delete;
run;
The second data step is (as others have also remarked) using a very dangerous approach: calling a LAG function conditionally.
Here's what the LAG function does:
It has an internal queue. Not a stack as it was called by @ChrisNZ , values in a queue are FIFO (first in, first out), while values in a stack are LIFO (last in, first out). This makes no difference when using the basic LAG function (the queue has a length of one value only), but when using e.g. LAG2 or LAG3, the queue is longer. Sorry for being a bit pedantic...
Stuff only gets pushed into the queue when the LAG function is called.
So what happens in your second data step? The statement
if lfirm=firm then lfirm = lag(firm);
does not do anything meaningful in this context. LFIRM has not been set to a value, so it is missing to begin with. The LAG function will only be called when FIRM is also missing, meaning that all the LAG function does is pull the old missing value from the queue, and push a new missing value (the current value of FIRM) into the queue. This means that LFIRM will always be missing.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.