Solved: Re: When to use If or If-Then in Data step

bruinsteph · Posted 05-14-2020 10:48 PM

I am wondering why these two statements give different results. The difference is where I set the If condition:

data dt;
set dt;
lyear = lag(year);
lfirm = lag(firm);
if lfirm=firm;
run;

data dt;
set dt;
lyear = lag(year);
if lfirm=firm then lfirm = lag(firm);
run;

s_lassen · Posted 05-15-2020 05:52 AM

The IF statement without THEN (IF <condition>) is equivalent to IF NOT <condition> THEN DELETE;

So, your first data step could also be written as

data dt;
  set dt;
  lyear = lag(year);
  lfirm = lag(firm);
  if lfirm ne firm then delete;
run;

The second data step is (as others have also remarked) using a very dangerous approach: calling a LAG function conditionally.

Here's what the LAG function does:

It has an internal queue. Not a stack as it was called by @ChrisNZ , values in a queue are FIFO (first in, first out), while values in a stack are LIFO (last in, first out). This makes no difference when using the basic LAG function (the queue has a length of one value only), but when using e.g. LAG2 or LAG3, the queue is longer. Sorry for being a bit pedantic...

Stuff only gets pushed into the queue when the LAG function is called.

So what happens in your second data step? The statement

if lfirm=firm then lfirm = lag(firm);

does not do anything meaningful in this context. LFIRM has not been set to a value, so it is missing to begin with. The LAG function will only be called when FIRM is also missing, meaning that all the LAG function does is pull the old missing value from the queue, and push a new missing value (the current value of FIRM) into the queue. This means that LFIRM will always be missing.

View solution in original post

SASKiwi · Posted 05-14-2020 11:45 PM

Never use the LAG function inside a condition - you will get unpredictable results.

ChrisNZ · Posted 05-15-2020 12:02 AM

The lag function manages its stack when it's called.

If you don't call it for every observation (for example you only call it when a condition is true), you don't get every value.

This is might a desirable outcome, but most often it is not.

High-Performance SAS Coding - Third Edition

s_lassen · Posted 05-15-2020 05:52 AM

The IF statement without THEN (IF <condition>) is equivalent to IF NOT <condition> THEN DELETE;

So, your first data step could also be written as

data dt;
  set dt;
  lyear = lag(year);
  lfirm = lag(firm);
  if lfirm ne firm then delete;
run;

The second data step is (as others have also remarked) using a very dangerous approach: calling a LAG function conditionally.

Here's what the LAG function does:

It has an internal queue. Not a stack as it was called by @ChrisNZ , values in a queue are FIFO (first in, first out), while values in a stack are LIFO (last in, first out). This makes no difference when using the basic LAG function (the queue has a length of one value only), but when using e.g. LAG2 or LAG3, the queue is longer. Sorry for being a bit pedantic...

Stuff only gets pushed into the queue when the LAG function is called.

So what happens in your second data step? The statement

if lfirm=firm then lfirm = lag(firm);

does not do anything meaningful in this context. LFIRM has not been set to a value, so it is missing to begin with. The LAG function will only be called when FIRM is also missing, meaning that all the LAG function does is pull the old missing value from the queue, and push a new missing value (the current value of FIRM) into the queue. This means that LFIRM will always be missing.

ChrisNZ · Posted 05-15-2020 06:04 AM

Queue vs stack. Noted. 🙂

High-Performance SAS Coding - Third Edition

When to use If or If-Then in Data step

Re: When to use If or If-Then in Data step

Re: When to use If or If-Then in Data step

Re: When to use If or If-Then in Data step

Re: When to use If or If-Then in Data step

Re: When to use If or If-Then in Data step

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away