Solved: Re: lag one variable in each setting group, using lag function.

Zhongda · Posted 09-30-2023 07:54 PM

Hi All,

here is the dataset:

data have;
input Id$ year income;
cards;
1 2000 33
1 2001 71
1 2002 76
1 2003 56
1 2004 56
2 2000 39
2 2001 74
2 2002 36
2 2003 48
2 2004 46
3 2000 34
3 2001 44
3 2002 69
3 2003 49
3 2004 31
4 2000 48
4 2001 38
4 2002 79
4 2003 48
4 2004 70
;
run;

and here is the target dataset:

id year income want_lag1_income want_lag2_income
1 2000 33 . .
1 2001 71 33 .
1 2002 76 71 33
1 2003 56 76 71
1 2004 56 56 76
2 2000 39 . .
2 2001 74 39 .
2 2002 36 74 39
2 2003 48 36 74
2 2004 46 48 36
3 2000 34 . .
3 2001 44 34 .
3 2002 69 44 34
3 2003 49 69 44
3 2004 31 49 69
4 2000 48 . .
4 2001 38 48 .
4 2002 79 38 48
4 2003 48 79 38
4 2004 70 48 79

Could you please help how to use

lag function to create a one period lag and two period lag

by a setting group, here I mean by id and year

I would like the lag 1 has 1 missing value for each group and lag 2 has 2 missing value.

Thank you so much for the help.

mkeintz · Posted 10-02-2023 01:52 PM

You can take advantage of the BY statement and the IFN function as below:

data want;
  set have;
  by id;
  income_lag1=ifn(first.id,.,lag1(income));
  income_lag2=ifn(first.id or lag(first.id),.,lag2(income));
run;

Now if you have a large number of lags to consider, the IFN functions take on progressively more extensive first arguments. You can get around that by this structure, in which an n-period lag takes its values from an n-1 period lag:

data want;
  set have;
  by id;
  income_lag1=ifn(first.id,.,lag1(income));
  income_lag2=ifn(first.id,.,lag1(income_lag1));
  income_lag3=ifn(first.id,.,lag1(income_lag2));
run;

Or you can avoid that sequential dependency by dropping the use of the BY ID statement, and comparing ID to n-period lagged values of ID:

data want;
  set have;
  income_lag1=ifn(id^=lag1(id),.,lag1(income));
  income_lag2=ifn(id^=lag2(id),.,lag2(income));
  income_lag3=ifn(id^=lag3(id),.,lag3(income));
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

WarrenKuhfeld · Posted 09-30-2023 10:00 PM

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lefunctionsref/n0l66p5oqex1f2n1quuopdvtcjqb.h...

Try it out. That is the best way to learn.

Zhongda · Posted 10-04-2023 09:44 AM

Warren, Thank you so much for pointing out this, this helps me alot.

ballardw · Posted 10-01-2023 12:07 AM

The lag part is easy, just call the function. The complicated part of your request, and not very, is the conditional requirement to set some values missing.

One way:

The BY statement means that SAS will set automatic variables First.<by variable> and Last.<by variable> which indicate whether the current observation is the first or last of a by group. These automatic variables are not added to the data set, are numeric 1/0 for true/false. These can be used the aid any of the conditional code to execute when needed. In this case we use it reset a counter variable that is Retained so the value is kept across iterations of the data step. So we know if the current observation is the first, second (or third or ...). Then based on that counter we can conditionally set values to missing. I use the Call Missing function because that function will work with multiple variables and mixed numeric or character variables if needed to set the values to the missing condition.

Then I drop the counter variable so it does not appear in your data set.

Note: The BY variable(s) should be in sort order. If not but do appear as a group you can add the NOTSORTED option to the BY statement.

data want;
   set have;
   by id;
   retain idcounter;
   if first.id then idcounter=1;
   else idcounter+1;
   lagincome1 = lag1(income);
   lagincome2 = lag2(income);
   if idcounter = 1 then call missing(lagincome1,lagincome2);
   if idcounter = 2 then call missing(lagincome2);
   drop idcounter;
run;

@Zhongda wrote:

Hi All,

here is the dataset:

data have;
input Id$ year income;
cards;
1 2000 33
1 2001 71
1 2002 76
1 2003 56
1 2004 56
2 2000 39
2 2001 74
2 2002 36
2 2003 48
2 2004 46
3 2000 34
3 2001 44
3 2002 69
3 2003 49
3 2004 31
4 2000 48
4 2001 38
4 2002 79
4 2003 48
4 2004 70
;
run;

and here is the target dataset:

id year income want_lag1_income want_lag2_income
1 2000 33 . .
1 2001 71 33 .
1 2002 76 71 33
1 2003 56 76 71
1 2004 56 56 76
2 2000 39 . .
2 2001 74 39 .
2 2002 36 74 39
2 2003 48 36 74
2 2004 46 48 36
3 2000 34 . .
3 2001 44 34 .
3 2002 69 44 34
3 2003 49 69 44
3 2004 31 49 69
4 2000 48 . .
4 2001 38 48 .
4 2002 79 38 48
4 2003 48 79 38
4 2004 70 48 79

Could you please help how to use

lag function to create a one period lag and two period lag

by a setting group, here I mean by id and year

I would like the lag 1 has 1 missing value for each group and lag 2 has 2 missing value.

Thank you so much for the help.

s_lassen · Posted 10-01-2023 06:53 AM

You can also try this:

data want;
  set have;
  by id;
  if first.id then call missing(want_lag1_income,want_lag2_income);
  output;
  want_lag2_income=want_lag1_income;
  want_lag1_income=income;
  retain want_lag1_income want_lag2_income;
run;

Report Inappropriate Content · Posted 10-01-2023 11:43 AM

data new(drop=i count);
set have;
by id;
array x(*) want_lag1_income want_lag2_income;
want_lag1_income=lag1(income);
want_lag2_income=lag2(income);
if first.id then count=1;
do i=count to dim(x);
x(i)=.;
end;
count + 1;
run;

mkeintz · Posted 10-02-2023 01:52 PM

You can take advantage of the BY statement and the IFN function as below:

data want;
  set have;
  by id;
  income_lag1=ifn(first.id,.,lag1(income));
  income_lag2=ifn(first.id or lag(first.id),.,lag2(income));
run;

Now if you have a large number of lags to consider, the IFN functions take on progressively more extensive first arguments. You can get around that by this structure, in which an n-period lag takes its values from an n-1 period lag:

data want;
  set have;
  by id;
  income_lag1=ifn(first.id,.,lag1(income));
  income_lag2=ifn(first.id,.,lag1(income_lag1));
  income_lag3=ifn(first.id,.,lag1(income_lag2));
run;

Or you can avoid that sequential dependency by dropping the use of the BY ID statement, and comparing ID to n-period lagged values of ID:

data want;
  set have;
  income_lag1=ifn(id^=lag1(id),.,lag1(income));
  income_lag2=ifn(id^=lag2(id),.,lag2(income));
  income_lag3=ifn(id^=lag3(id),.,lag3(income));
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Classroom Training Available!