Hello,
I am trying to assign values of variables from previous observations to new variables in current observations within unique ID groupings using the LAG function. To prevent assigning the last value of previous unique IDs to the first observation for subsequent IDs, I used a SET statement with BY group processing to enable use of FIRST.ID logic to leave values for the first observation of each unique ID missing. However, I am getting unexpected results.
For the first unique ID, I am getting missing values returned by the LAG function for the first two observations rather than just the first. For subsequent ID numbers, I am getting the lagged value of the last observation of the previous ID assigned to the second observation of the current ID. Below is a sample code I created.
data test;
input ID Var1 $ Var2;
datalines;
1 A 0
1 A 1
1 A 1
1 A 0
1 A 0
2 B 0
2 B 0
2 B 0
3 C .
3 C 1
3 C 1
3 C 0
;
run;
data test_mod;
set test;
by ID;
*Want values for first obs missing;
*Want values of 2nd obs to have value of 1st obs within each ID group;
if not first.ID then do;
Prev_Var1 = lag(Var1);
Prev_Var2 = lag(Var2);
end;
run;
The dataset Test_Mod looks like this.
ID Var1 Var2 Prev_Var1 Prev_Var2
1 A 0 .
1 A 1 .
1 A 1 A 1
1 A 0 A 1
1 A 0 A 0
2 B 0 .
2 B 0 A 0
2 B 0 B 0
3 C . .
3 C 1 B 0
3 C 1 C 1
3 C 0 C 1
Very common issue when using the LAG() function.
Do not use it in a conditional branch; it only feeds values to the internal queue when it is called, so it won't pick up a new value in the first observation of a group.
Restructure your code:
data test_mod;
set test;
by ID;
Prev_Var1 = lag(Var1);
Prev_Var2 = lag(Var2);
if first.ID then do;
Prev_Var1 = .;
Prev_Var2 = .;
end;
run;
Very common issue when using the LAG() function.
Do not use it in a conditional branch; it only feeds values to the internal queue when it is called, so it won't pick up a new value in the first observation of a group.
Restructure your code:
data test_mod;
set test;
by ID;
Prev_Var1 = lag(Var1);
Prev_Var2 = lag(Var2);
if first.ID then do;
Prev_Var1 = .;
Prev_Var2 = .;
end;
run;
Alternatively, you can do this. This works because the Ifn/Ifc functions execute the Lag Function in the third argument regardless of whether the condition is true or not. This way, the queue/dequeue mechanism behind the scenes does not skip any values.
data test;
input ID Var1 $ Var2;
datalines;
1 A 0
1 A 1
1 A 1
1 A 0
1 A 0
2 B 0
2 B 0
2 B 0
3 C .
3 C 1
3 C 1
3 C 0
;
run;
data test_mod;
set test;
by ID;
Prev_Var1 = ifc(first.id, ' ', lag(Var1));
Prev_Var2 = ifn(first.id, . , lag(Var2));
run;
Hi @PeterClemmensen ,
Thanks for the suggestion. I was not familiar with those functions. So it is good to learn about them!
These functions are a boon for situations like yours, as they make sure that both branches are executed, while the result of only one is kept. Being a very traditional programmer myself, I always forget to mention them 😉
@Kurt_Bremser wrote:
These functions are a boon for situations like yours, as they make sure that both branches are executed, while the result of only one is kept. Being a very traditional programmer myself, I always forget to mention them 😉
Partially the same about "traditional" but I find the IFC/IFN to look ugly, especially if nested, and hard to read quickly. Maybe just lack of practice but I used such that look like than in spreadsheets and constantly cringed while typing such formulae.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.