I have a dataset, an excerpt of it is given below:
DATA have;
id year output
;
DATALINES;
1001 1987 .
1001 1988 .
1001 1989 .
1001 1990 5
1001 1991 .
1001 1992 .
1001 1993 .
1001 1994 34
1001 1995 22
1001 1996 33
1001 1997 15
1001 1998 .
1001 1999 .
1001 2000 .
1001 2001 23
1001 2002 .
1001 2003 45
1001 2004 23
1001 2005 12
1001 2006 .
1001 2007 .
1001 2008 .
1001 2009 .
1001 2010 2
1001 2011 .
1001 2012 .
1001 2013 56
1001 2014 .
1001 2015 .
1001 2016 .
1001 2017 .
1001 2018 23
1001 2019 .
1002 1987 34
1002 1988 .
1002 1989 12
1002 1990 13
1002 1991 55
1002 1992 32
1002 1993 .
1002 1994 .
1002 1995 54
1002 1996 64
1002 1997 .
1002 1998 .
1002 1999 23
1002 2000 .
1002 2001 .
1002 2002 .
1002 2003 .
1002 2004 64
1002 2005 12
1002 2006 .
1002 2007 .
1002 2008 .
1002 2009 .
1002 2010 3
1002 2011 .
1002 2012 .
1002 2013 .
1002 2014 .
1002 2015 .
1002 2016 .
1002 2017 4
1002 2018 .
1002 2019 12
;
RUN;
Each id always has year going from 1987 to 2019 (no gaps). I want to "lag" the output variable to produce the following dataset
DATA want;
id year output output_prev;
DATALINES;
1001 1987 . .
1001 1988 . .
1001 1989 . .
1001 1990 5 .
1001 1991 . 5
1001 1992 . 5
1001 1993 . 5
1001 1994 34 5
1001 1995 22 34
1001 1996 33 22
1001 1997 15 33
1001 1998 . 15
1001 1999 . 15
1001 2000 . 15
1001 2001 23 15
1001 2002 . 23
1001 2003 45 23
1001 2004 23 45
1001 2005 12 23
1001 2006 . 12
1001 2007 . 12
1001 2008 . 12
1001 2009 . 12
1001 2010 2 12
1001 2011 . 2
1001 2012 . 2
1001 2013 56 2
1001 2014 . 56
1001 2015 . 56
1001 2016 . 56
1001 2017 . 56
1001 2018 23 56
1001 2019 . 23
1002 1987 34 .
1002 1988 . 34
1002 1989 12 34
1002 1990 13 12
1002 1991 55 13
1002 1992 32 55
1002 1993 . 32
1002 1994 . 32
1002 1995 54 32
1002 1996 64 54
1002 1997 . 64
1002 1998 . 64
1002 1999 23 64
1002 2000 . 23
1002 2001 . 23
1002 2002 . 23
1002 2003 . 23
1002 2004 64 23
1002 2005 12 64
1002 2006 . 12
1002 2007 . 12
1002 2008 . 12
1002 2009 . 12
1002 2010 3 12
1002 2011 . 3
1002 2012 . 3
1002 2013 . 3
1002 2014 . 3
1002 2015 . 3
1002 2016 . 3
1002 2017 4 3
1002 2018 . 4
1002 2019 12 4
;
RUN;
It seems simple but I can't seem to do it. I want to lag the output variable so that each year for a given id always takes the "latest" value of output. As an example, for id=1001, year=1991, output_prev=5 because that was the latest value of output (last updated in year 1990). This continues until we reach year 1995, where output_prev changes to 34 because the latest value of output is 34 (last updated in year 1994).
@elbarto wrote:
I have a dataset, an excerpt of it is given below:
DATA have; id year output ; DATALINES; 1001 1987 . 1001 1988 . 1001 1989 . 1001 1990 5 1001 1991 . 1001 1992 . 1001 1993 . 1001 1994 34 1001 1995 22 1001 1996 33 1001 1997 15 1001 1998 . 1001 1999 . 1001 2000 . 1001 2001 23 1001 2002 . 1001 2003 45 1001 2004 23 1001 2005 12 1001 2006 . 1001 2007 . 1001 2008 . 1001 2009 . 1001 2010 2 1001 2011 . 1001 2012 . 1001 2013 56 1001 2014 . 1001 2015 . 1001 2016 . 1001 2017 . 1001 2018 23 1001 2019 . 1002 1987 34 1002 1988 . 1002 1989 12 1002 1990 13 1002 1991 55 1002 1992 32 1002 1993 . 1002 1994 . 1002 1995 54 1002 1996 64 1002 1997 . 1002 1998 . 1002 1999 23 1002 2000 . 1002 2001 . 1002 2002 . 1002 2003 . 1002 2004 64 1002 2005 12 1002 2006 . 1002 2007 . 1002 2008 . 1002 2009 . 1002 2010 3 1002 2011 . 1002 2012 . 1002 2013 . 1002 2014 . 1002 2015 . 1002 2016 . 1002 2017 4 1002 2018 . 1002 2019 12 ; RUN;
Each id always has year going from 1987 to 2019 (no gaps). I want to "lag" the output variable to produce the following dataset
DATA want; id year output output_prev; DATALINES; 1001 1987 . . 1001 1988 . . 1001 1989 . . 1001 1990 5 . 1001 1991 . 5 1001 1992 . 5 1001 1993 . 5 1001 1994 34 5 1001 1995 22 34 1001 1996 33 22 1001 1997 15 33 1001 1998 . 15 1001 1999 . 15 1001 2000 . 15 1001 2001 23 15 1001 2002 . 23 1001 2003 45 23 1001 2004 23 45 1001 2005 12 23 1001 2006 . 12 1001 2007 . 12 1001 2008 . 12 1001 2009 . 12 1001 2010 2 12 1001 2011 . 2 1001 2012 . 2 1001 2013 56 2 1001 2014 . 56 1001 2015 . 56 1001 2016 . 56 1001 2017 . 56 1001 2018 23 56 1001 2019 . 23 1002 1987 34 . 1002 1988 . 34 1002 1989 12 34 1002 1990 13 12 1002 1991 55 13 1002 1992 32 55 1002 1993 . 32 1002 1994 . 32 1002 1995 54 32 1002 1996 64 54 1002 1997 . 64 1002 1998 . 64 1002 1999 23 64 1002 2000 . 23 1002 2001 . 23 1002 2002 . 23 1002 2003 . 23 1002 2004 64 23 1002 2005 12 64 1002 2006 . 12 1002 2007 . 12 1002 2008 . 12 1002 2009 . 12 1002 2010 3 12 1002 2011 . 3 1002 2012 . 3 1002 2013 . 3 1002 2014 . 3 1002 2015 . 3 1002 2016 . 3 1002 2017 4 3 1002 2018 . 4 1002 2019 12 4 ; RUN;
It seems simple but I can't seem to do it. I want to lag the output variable so that each year for a given id always takes the "latest" value of output. As an example, for id=1001, year=1991, output_prev=5 because that was the latest value of output (last updated in year 1990). This continues until we reach year 1995, where output_prev changes to 34 because the latest value of output is 34 (last updated in year 1994).
Please make sure that your example data step code runs. You are missing an INPUT statement.
For this data you actually want a RETAIN to hold the value across iterations of the data step and know when to write the output record to the data. (BTW naming a variable OUTPUT makes this somewhat confusing as Output is a programming statement as well and could lead to all sorts of miscommunication).
This seems to create your "want" from the example.
DATA have; input id year output ; DATALINES; 1001 1987 . 1001 1988 . 1001 1989 . 1001 1990 5 1001 1991 . 1001 1992 . 1001 1993 . 1001 1994 34 1001 1995 22 1001 1996 33 1001 1997 15 1001 1998 . 1001 1999 . 1001 2000 . 1001 2001 23 1001 2002 . 1001 2003 45 1001 2004 23 1001 2005 12 1001 2006 . 1001 2007 . 1001 2008 . 1001 2009 . 1001 2010 2 1001 2011 . 1001 2012 . 1001 2013 56 1001 2014 . 1001 2015 . 1001 2016 . 1001 2017 . 1001 2018 23 1001 2019 . 1002 1987 34 1002 1988 . 1002 1989 12 1002 1990 13 1002 1991 55 1002 1992 32 1002 1993 . 1002 1994 . 1002 1995 54 1002 1996 64 1002 1997 . 1002 1998 . 1002 1999 23 1002 2000 . 1002 2001 . 1002 2002 . 1002 2003 . 1002 2004 64 1002 2005 12 1002 2006 . 1002 2007 . 1002 2008 . 1002 2009 . 1002 2010 3 1002 2011 . 1002 2012 . 1002 2013 . 1002 2014 . 1002 2015 . 1002 2016 . 1002 2017 4 1002 2018 . 1002 2019 12 ; /* assumes sorted by ID, if not*/ proc sort data=have; by id year; run; data want; set have; retain out_prev; by id; if first.id then call missing(out_prev); output; if not missing(output) then out_prev=output; run;
Retain keeps values of variables across the data step boundary.
Placing the OUTPUT statement to write to the data before assigning the "new" value for output (see what I meant about confusion) means the the record is written with the last value of output_prev.
@elbarto wrote:
I have a dataset, an excerpt of it is given below:
DATA have; id year output ; DATALINES; 1001 1987 . 1001 1988 . 1001 1989 . 1001 1990 5 1001 1991 . 1001 1992 . 1001 1993 . 1001 1994 34 1001 1995 22 1001 1996 33 1001 1997 15 1001 1998 . 1001 1999 . 1001 2000 . 1001 2001 23 1001 2002 . 1001 2003 45 1001 2004 23 1001 2005 12 1001 2006 . 1001 2007 . 1001 2008 . 1001 2009 . 1001 2010 2 1001 2011 . 1001 2012 . 1001 2013 56 1001 2014 . 1001 2015 . 1001 2016 . 1001 2017 . 1001 2018 23 1001 2019 . 1002 1987 34 1002 1988 . 1002 1989 12 1002 1990 13 1002 1991 55 1002 1992 32 1002 1993 . 1002 1994 . 1002 1995 54 1002 1996 64 1002 1997 . 1002 1998 . 1002 1999 23 1002 2000 . 1002 2001 . 1002 2002 . 1002 2003 . 1002 2004 64 1002 2005 12 1002 2006 . 1002 2007 . 1002 2008 . 1002 2009 . 1002 2010 3 1002 2011 . 1002 2012 . 1002 2013 . 1002 2014 . 1002 2015 . 1002 2016 . 1002 2017 4 1002 2018 . 1002 2019 12 ; RUN;
Each id always has year going from 1987 to 2019 (no gaps). I want to "lag" the output variable to produce the following dataset
DATA want; id year output output_prev; DATALINES; 1001 1987 . . 1001 1988 . . 1001 1989 . . 1001 1990 5 . 1001 1991 . 5 1001 1992 . 5 1001 1993 . 5 1001 1994 34 5 1001 1995 22 34 1001 1996 33 22 1001 1997 15 33 1001 1998 . 15 1001 1999 . 15 1001 2000 . 15 1001 2001 23 15 1001 2002 . 23 1001 2003 45 23 1001 2004 23 45 1001 2005 12 23 1001 2006 . 12 1001 2007 . 12 1001 2008 . 12 1001 2009 . 12 1001 2010 2 12 1001 2011 . 2 1001 2012 . 2 1001 2013 56 2 1001 2014 . 56 1001 2015 . 56 1001 2016 . 56 1001 2017 . 56 1001 2018 23 56 1001 2019 . 23 1002 1987 34 . 1002 1988 . 34 1002 1989 12 34 1002 1990 13 12 1002 1991 55 13 1002 1992 32 55 1002 1993 . 32 1002 1994 . 32 1002 1995 54 32 1002 1996 64 54 1002 1997 . 64 1002 1998 . 64 1002 1999 23 64 1002 2000 . 23 1002 2001 . 23 1002 2002 . 23 1002 2003 . 23 1002 2004 64 23 1002 2005 12 64 1002 2006 . 12 1002 2007 . 12 1002 2008 . 12 1002 2009 . 12 1002 2010 3 12 1002 2011 . 3 1002 2012 . 3 1002 2013 . 3 1002 2014 . 3 1002 2015 . 3 1002 2016 . 3 1002 2017 4 3 1002 2018 . 4 1002 2019 12 4 ; RUN;
It seems simple but I can't seem to do it. I want to lag the output variable so that each year for a given id always takes the "latest" value of output. As an example, for id=1001, year=1991, output_prev=5 because that was the latest value of output (last updated in year 1990). This continues until we reach year 1995, where output_prev changes to 34 because the latest value of output is 34 (last updated in year 1994).
Please make sure that your example data step code runs. You are missing an INPUT statement.
For this data you actually want a RETAIN to hold the value across iterations of the data step and know when to write the output record to the data. (BTW naming a variable OUTPUT makes this somewhat confusing as Output is a programming statement as well and could lead to all sorts of miscommunication).
This seems to create your "want" from the example.
DATA have; input id year output ; DATALINES; 1001 1987 . 1001 1988 . 1001 1989 . 1001 1990 5 1001 1991 . 1001 1992 . 1001 1993 . 1001 1994 34 1001 1995 22 1001 1996 33 1001 1997 15 1001 1998 . 1001 1999 . 1001 2000 . 1001 2001 23 1001 2002 . 1001 2003 45 1001 2004 23 1001 2005 12 1001 2006 . 1001 2007 . 1001 2008 . 1001 2009 . 1001 2010 2 1001 2011 . 1001 2012 . 1001 2013 56 1001 2014 . 1001 2015 . 1001 2016 . 1001 2017 . 1001 2018 23 1001 2019 . 1002 1987 34 1002 1988 . 1002 1989 12 1002 1990 13 1002 1991 55 1002 1992 32 1002 1993 . 1002 1994 . 1002 1995 54 1002 1996 64 1002 1997 . 1002 1998 . 1002 1999 23 1002 2000 . 1002 2001 . 1002 2002 . 1002 2003 . 1002 2004 64 1002 2005 12 1002 2006 . 1002 2007 . 1002 2008 . 1002 2009 . 1002 2010 3 1002 2011 . 1002 2012 . 1002 2013 . 1002 2014 . 1002 2015 . 1002 2016 . 1002 2017 4 1002 2018 . 1002 2019 12 ; /* assumes sorted by ID, if not*/ proc sort data=have; by id year; run; data want; set have; retain out_prev; by id; if first.id then call missing(out_prev); output; if not missing(output) then out_prev=output; run;
Retain keeps values of variables across the data step boundary.
Placing the OUTPUT statement to write to the data before assigning the "new" value for output (see what I meant about confusion) means the the record is written with the last value of output_prev.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.