BookmarkSubscribeRSS Feed
Callam1
Obsidian | Level 7

Hi

Copilot claims that the lag function is not safe to use in the code below.  I am trying to calculate a variable 'pay' as the difference between cumulative pay in two consecutive rows, unless certain conditions are present. Copilot says:  'lag() maintains a global queue and does not reset at BY‑group boundaries. That means your first record in a new personID (or employerID or tax year), can subtract a value from a different ID/employerID/tax year. Fix: use BY groups with first. flags.'.......should I trust this advice?

 

proc sort data=sorted_data;
by personID employerid tax_year_start payment_date;
run;

/*Pay is the difference in cumulative pay between 2 consecutive payment dates, unless it is a new person, new employer or new tax yar, in which
case pay should be equal to start_pay;
data pay_data;
set sorted_data;
pay = cum_pay - lag(cum_pay);

if personid~= lag(personid)
or employerid~= lag(employerid)
or tax_year_start~=lag(tax_year_start)
then pay=start_pay;

run;

 

3 REPLIES 3
Patrick
Opal | Level 21

What Copilot tells you is true for:

pay=cum_pay - lag(cum_pay);

But in your logic you then populate variable pay again with start_pay whenever the condition becomes true - and the condition will always be true if there is a different ID/employerID/tax year and though the issue Copilot flags won't come into effect.

if personid~= lag(personid)
   or employerid~= lag(employerid) 
   or tax_year_start~=lag(tax_year_start)
then pay=start_pay; 

 

You could consider to express your logic as below.

data pay_data;
  set sorted_data;
  by personID employerid tax_year_start payment_date;
  pay=cum_pay - lag(cum_pay);
  if first.tax_year_start then pay=start_pay;
run;

Note: lag(cum_pay) needs to execute for every row and though can't get moved to an else condition.

 

Tom
Super User Tom
Super User

Yes.  But it looks like you already did that, you just wrote the logic in a confusing way.

 

Instead do as Copilot says and add a BY statement to the DATA step and use the FIRST. flag that SAS will set automatically instead of trying to calculate if it is the first observation in the group on your own. 

 

Note that in addition to setting the various FIRST. and LAST. flags the BY statement will also insure that the data step fails if for some reason SORTED_DATA is not actually sorted in the way the BY statement says.  Your step would instead happily generate gibberish values for PAY in that situation.

 

Also note that you should use the DIF() function instead of the LAG() function.  That will not only save some typing but also make the meaning of the code clearer and eliminate the nasty note about missing values.

 71         data test;
 72           set sashelp.class;
 73           age1=age-lag(age);
 74           age2=dif(age);
 75         run;
 
 NOTE: Missing values were generated as a result of performing an operation on missing values.
       Each place is given by: (Number of times) at (Line):(Column).
       1 at 73:11   
 NOTE: There were 19 observations read from the data set SASHELP.CLASS.
 NOTE: The data set WORK.TEST has 19 observations and 7 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       cpu time            0.00 seconds
       

So your code should look like this:

proc sort data=sorted_data;
  by personID employerid tax_year_start payment_date;
run;

data pay_data;
  set sorted_data;
  by personID employerid tax_year_start payment_date;
  pay = dif(cum_pay);
  if first.tax_year_start then pay=start_pay;
run;

 

Also note that I just assumed you know what your data is and that the variable START_PAY has the value you want for PAY on the first observation per PERSIONID, per EMPLOYERID, per TAX_YEAR_START.   If you don't actually have such a START_PAY variable then look and the data and see if CUM_PAY might already be resetting for each new TAX_YEAR_START so that it can be used to set PAY.

Ksharp
Super User

Your LAG() is safe.
a.k.a LAG() is not executed under IF condition.
About the LAG() in IF statement, LAG() would executed every time(or data loop) , therefore you do not need to worry about it .

If you know the exactly logic behind LAG(), you would also get the right code or result.
For example:
If you want last obs subtract the first obs in each group when each group has at least 2 obs, you could code like:

 

data x;
input a b;
cards;
1 1
1 2
1 3
1 4
2 12
2 13
2 14
;

data y;
set x;
by a;
if first.a or last.a then first=lag(b);
if last.a then diff=b-first;
run;

Ksharp_0-1765682506615.png

 

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 146 views
  • 2 likes
  • 4 in conversation