Hi All,
here is the dataset:
data have;
input Id$ year income;
cards;
1 2000 33
1 2001 71
1 2002 76
1 2003 56
1 2004 56
2 2000 39
2 2001 74
2 2002 36
2 2003 48
2 2004 46
3 2000 34
3 2001 44
3 2002 69
3 2003 49
3 2004 31
4 2000 48
4 2001 38
4 2002 79
4 2003 48
4 2004 70
;
run;
and here is the target dataset:
id year income want_lag1_income want_lag2_income
1 2000 33 . .
1 2001 71 33 .
1 2002 76 71 33
1 2003 56 76 71
1 2004 56 56 76
2 2000 39 . .
2 2001 74 39 .
2 2002 36 74 39
2 2003 48 36 74
2 2004 46 48 36
3 2000 34 . .
3 2001 44 34 .
3 2002 69 44 34
3 2003 49 69 44
3 2004 31 49 69
4 2000 48 . .
4 2001 38 48 .
4 2002 79 38 48
4 2003 48 79 38
4 2004 70 48 79
Could you please help how to use
lag function to create a one period lag and two period lag
by a setting group, here I mean by id and year
I would like the lag 1 has 1 missing value for each group and lag 2 has 2 missing value.
Thank you so much for the help.
You can take advantage of the BY statement and the IFN function as below:
data want;
set have;
by id;
income_lag1=ifn(first.id,.,lag1(income));
income_lag2=ifn(first.id or lag(first.id),.,lag2(income));
run;
Now if you have a large number of lags to consider, the IFN functions take on progressively more extensive first arguments. You can get around that by this structure, in which an n-period lag takes its values from an n-1 period lag:
data want;
set have;
by id;
income_lag1=ifn(first.id,.,lag1(income));
income_lag2=ifn(first.id,.,lag1(income_lag1));
income_lag3=ifn(first.id,.,lag1(income_lag2));
run;
Or you can avoid that sequential dependency by dropping the use of the BY ID statement, and comparing ID to n-period lagged values of ID:
data want;
set have;
income_lag1=ifn(id^=lag1(id),.,lag1(income));
income_lag2=ifn(id^=lag2(id),.,lag2(income));
income_lag3=ifn(id^=lag3(id),.,lag3(income));
run;
Try it out. That is the best way to learn.
The lag part is easy, just call the function. The complicated part of your request, and not very, is the conditional requirement to set some values missing.
One way:
The BY statement means that SAS will set automatic variables First.<by variable> and Last.<by variable> which indicate whether the current observation is the first or last of a by group. These automatic variables are not added to the data set, are numeric 1/0 for true/false. These can be used the aid any of the conditional code to execute when needed. In this case we use it reset a counter variable that is Retained so the value is kept across iterations of the data step. So we know if the current observation is the first, second (or third or ...). Then based on that counter we can conditionally set values to missing. I use the Call Missing function because that function will work with multiple variables and mixed numeric or character variables if needed to set the values to the missing condition.
Then I drop the counter variable so it does not appear in your data set.
Note: The BY variable(s) should be in sort order. If not but do appear as a group you can add the NOTSORTED option to the BY statement.
data want; set have; by id; retain idcounter; if first.id then idcounter=1; else idcounter+1; lagincome1 = lag1(income); lagincome2 = lag2(income); if idcounter = 1 then call missing(lagincome1,lagincome2); if idcounter = 2 then call missing(lagincome2); drop idcounter; run;
@Zhongda wrote:
Hi All,
here is the dataset:
data have;
input Id$ year income;
cards;
1 2000 33
1 2001 71
1 2002 76
1 2003 56
1 2004 56
2 2000 39
2 2001 74
2 2002 36
2 2003 48
2 2004 46
3 2000 34
3 2001 44
3 2002 69
3 2003 49
3 2004 31
4 2000 48
4 2001 38
4 2002 79
4 2003 48
4 2004 70
;
run;
and here is the target dataset:
id year income want_lag1_income want_lag2_income
1 2000 33 . .
1 2001 71 33 .
1 2002 76 71 33
1 2003 56 76 71
1 2004 56 56 76
2 2000 39 . .
2 2001 74 39 .
2 2002 36 74 39
2 2003 48 36 74
2 2004 46 48 36
3 2000 34 . .
3 2001 44 34 .
3 2002 69 44 34
3 2003 49 69 44
3 2004 31 49 69
4 2000 48 . .
4 2001 38 48 .
4 2002 79 38 48
4 2003 48 79 38
4 2004 70 48 79
Could you please help how to use
lag function to create a one period lag and two period lag
by a setting group, here I mean by id and year
I would like the lag 1 has 1 missing value for each group and lag 2 has 2 missing value.
Thank you so much for the help.
You can also try this:
data want;
set have;
by id;
if first.id then call missing(want_lag1_income,want_lag2_income);
output;
want_lag2_income=want_lag1_income;
want_lag1_income=income;
retain want_lag1_income want_lag2_income;
run;
data new(drop=i count);
set have;
by id;
array x(*) want_lag1_income want_lag2_income;
want_lag1_income=lag1(income);
want_lag2_income=lag2(income);
if first.id then count=1;
do i=count to dim(x);
x(i)=.;
end;
count + 1;
run;
You can take advantage of the BY statement and the IFN function as below:
data want;
set have;
by id;
income_lag1=ifn(first.id,.,lag1(income));
income_lag2=ifn(first.id or lag(first.id),.,lag2(income));
run;
Now if you have a large number of lags to consider, the IFN functions take on progressively more extensive first arguments. You can get around that by this structure, in which an n-period lag takes its values from an n-1 period lag:
data want;
set have;
by id;
income_lag1=ifn(first.id,.,lag1(income));
income_lag2=ifn(first.id,.,lag1(income_lag1));
income_lag3=ifn(first.id,.,lag1(income_lag2));
run;
Or you can avoid that sequential dependency by dropping the use of the BY ID statement, and comparing ID to n-period lagged values of ID:
data want;
set have;
income_lag1=ifn(id^=lag1(id),.,lag1(income));
income_lag2=ifn(id^=lag2(id),.,lag2(income));
income_lag3=ifn(id^=lag3(id),.,lag3(income));
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.