BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
elbarto
Obsidian | Level 7

I have a dataset, an excerpt of it is given below:

 

DATA have;
id	year	output
;
DATALINES;
1001	1987	.
1001	1988	.
1001	1989	.
1001	1990	5
1001	1991	.
1001	1992	.
1001	1993	.
1001	1994	34
1001	1995	22
1001	1996	33
1001	1997	15
1001	1998	.
1001	1999	.
1001	2000	.
1001	2001	23
1001	2002	.
1001	2003	45
1001	2004	23
1001	2005	12
1001	2006	.
1001	2007	.
1001	2008	.
1001	2009	.
1001	2010	2
1001	2011	.
1001	2012	.
1001	2013	56
1001	2014	.
1001	2015	.
1001	2016	.
1001	2017	.
1001	2018	23
1001	2019	.
1002	1987	34
1002	1988	.
1002	1989	12
1002	1990	13
1002	1991	55
1002	1992	32
1002	1993	.
1002	1994	.
1002	1995	54
1002	1996	64
1002	1997	.
1002	1998	.
1002	1999	23
1002	2000	.
1002	2001	.
1002	2002	.
1002	2003	.
1002	2004	64
1002	2005	12
1002	2006	.
1002	2007	.
1002	2008	.
1002	2009	.
1002	2010	3
1002	2011	.
1002	2012	.
1002	2013	.
1002	2014	.
1002	2015	.
1002	2016	.
1002	2017	4
1002	2018	.
1002	2019	12
;
RUN;

 

Each id always has year going from 1987 to 2019 (no gaps). I want to "lag" the output variable to produce the following dataset

 

DATA want;
id	year	output	output_prev;
DATALINES;
1001	1987	.	.
1001	1988	.	.
1001	1989	.	.
1001	1990	5	.
1001	1991	.	5
1001	1992	.	5
1001	1993	.	5
1001	1994	34	5
1001	1995	22	34
1001	1996	33	22
1001	1997	15	33
1001	1998	.	15
1001	1999	.	15
1001	2000	.	15
1001	2001	23	15
1001	2002	.	23
1001	2003	45	23
1001	2004	23	45
1001	2005	12	23
1001	2006	.	12
1001	2007	.	12
1001	2008	.	12
1001	2009	.	12
1001	2010	2	12
1001	2011	.	2
1001	2012	.	2
1001	2013	56	2
1001	2014	.	56
1001	2015	.	56
1001	2016	.	56
1001	2017	.	56
1001	2018	23	56
1001	2019	.	23
1002	1987	34	.
1002	1988	.	34
1002	1989	12	34
1002	1990	13	12
1002	1991	55	13
1002	1992	32	55
1002	1993	.	32
1002	1994	.	32
1002	1995	54	32
1002	1996	64	54
1002	1997	.	64
1002	1998	.	64
1002	1999	23	64
1002	2000	.	23
1002	2001	.	23
1002	2002	.	23
1002	2003	.	23
1002	2004	64	23
1002	2005	12	64
1002	2006	.	12
1002	2007	.	12
1002	2008	.	12
1002	2009	.	12
1002	2010	3	12
1002	2011	.	3
1002	2012	.	3
1002	2013	.	3
1002	2014	.	3
1002	2015	.	3
1002	2016	.	3
1002	2017	4	3
1002	2018	.	4
1002	2019	12	4
;
RUN;

It seems simple but I can't seem to do it. I want to lag the output variable so that each year for a given id always takes the "latest" value of output. As an example, for id=1001, year=1991, output_prev=5 because that was the latest value of output (last updated in year 1990). This continues until we reach year 1995, where output_prev changes to 34 because the latest value of output is 34 (last updated in year 1994).

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@elbarto wrote:

I have a dataset, an excerpt of it is given below:

 

DATA have;
id	year	output
;
DATALINES;
1001	1987	.
1001	1988	.
1001	1989	.
1001	1990	5
1001	1991	.
1001	1992	.
1001	1993	.
1001	1994	34
1001	1995	22
1001	1996	33
1001	1997	15
1001	1998	.
1001	1999	.
1001	2000	.
1001	2001	23
1001	2002	.
1001	2003	45
1001	2004	23
1001	2005	12
1001	2006	.
1001	2007	.
1001	2008	.
1001	2009	.
1001	2010	2
1001	2011	.
1001	2012	.
1001	2013	56
1001	2014	.
1001	2015	.
1001	2016	.
1001	2017	.
1001	2018	23
1001	2019	.
1002	1987	34
1002	1988	.
1002	1989	12
1002	1990	13
1002	1991	55
1002	1992	32
1002	1993	.
1002	1994	.
1002	1995	54
1002	1996	64
1002	1997	.
1002	1998	.
1002	1999	23
1002	2000	.
1002	2001	.
1002	2002	.
1002	2003	.
1002	2004	64
1002	2005	12
1002	2006	.
1002	2007	.
1002	2008	.
1002	2009	.
1002	2010	3
1002	2011	.
1002	2012	.
1002	2013	.
1002	2014	.
1002	2015	.
1002	2016	.
1002	2017	4
1002	2018	.
1002	2019	12
;
RUN;

 

Each id always has year going from 1987 to 2019 (no gaps). I want to "lag" the output variable to produce the following dataset

 

DATA want;
id	year	output	output_prev;
DATALINES;
1001	1987	.	.
1001	1988	.	.
1001	1989	.	.
1001	1990	5	.
1001	1991	.	5
1001	1992	.	5
1001	1993	.	5
1001	1994	34	5
1001	1995	22	34
1001	1996	33	22
1001	1997	15	33
1001	1998	.	15
1001	1999	.	15
1001	2000	.	15
1001	2001	23	15
1001	2002	.	23
1001	2003	45	23
1001	2004	23	45
1001	2005	12	23
1001	2006	.	12
1001	2007	.	12
1001	2008	.	12
1001	2009	.	12
1001	2010	2	12
1001	2011	.	2
1001	2012	.	2
1001	2013	56	2
1001	2014	.	56
1001	2015	.	56
1001	2016	.	56
1001	2017	.	56
1001	2018	23	56
1001	2019	.	23
1002	1987	34	.
1002	1988	.	34
1002	1989	12	34
1002	1990	13	12
1002	1991	55	13
1002	1992	32	55
1002	1993	.	32
1002	1994	.	32
1002	1995	54	32
1002	1996	64	54
1002	1997	.	64
1002	1998	.	64
1002	1999	23	64
1002	2000	.	23
1002	2001	.	23
1002	2002	.	23
1002	2003	.	23
1002	2004	64	23
1002	2005	12	64
1002	2006	.	12
1002	2007	.	12
1002	2008	.	12
1002	2009	.	12
1002	2010	3	12
1002	2011	.	3
1002	2012	.	3
1002	2013	.	3
1002	2014	.	3
1002	2015	.	3
1002	2016	.	3
1002	2017	4	3
1002	2018	.	4
1002	2019	12	4
;
RUN;

It seems simple but I can't seem to do it. I want to lag the output variable so that each year for a given id always takes the "latest" value of output. As an example, for id=1001, year=1991, output_prev=5 because that was the latest value of output (last updated in year 1990). This continues until we reach year 1995, where output_prev changes to 34 because the latest value of output is 34 (last updated in year 1994).

 

 


Please make sure that your example data step code runs. You are missing an INPUT statement.

For this data you actually want a RETAIN to hold the value across iterations of the data step and know when to write the output record to the data. (BTW naming a variable OUTPUT makes this somewhat confusing as Output is a programming statement as well and could lead to all sorts of miscommunication).

This seems to create your "want" from the example.

DATA have;
input id	year	output
;
DATALINES;
1001	1987	.
1001	1988	.
1001	1989	.
1001	1990	5
1001	1991	.
1001	1992	.
1001	1993	.
1001	1994	34
1001	1995	22
1001	1996	33
1001	1997	15
1001	1998	.
1001	1999	.
1001	2000	.
1001	2001	23
1001	2002	.
1001	2003	45
1001	2004	23
1001	2005	12
1001	2006	.
1001	2007	.
1001	2008	.
1001	2009	.
1001	2010	2
1001	2011	.
1001	2012	.
1001	2013	56
1001	2014	.
1001	2015	.
1001	2016	.
1001	2017	.
1001	2018	23
1001	2019	.
1002	1987	34
1002	1988	.
1002	1989	12
1002	1990	13
1002	1991	55
1002	1992	32
1002	1993	.
1002	1994	.
1002	1995	54
1002	1996	64
1002	1997	.
1002	1998	.
1002	1999	23
1002	2000	.
1002	2001	.
1002	2002	.
1002	2003	.
1002	2004	64
1002	2005	12
1002	2006	.
1002	2007	.
1002	2008	.
1002	2009	.
1002	2010	3
1002	2011	.
1002	2012	.
1002	2013	.
1002	2014	.
1002	2015	.
1002	2016	.
1002	2017	4
1002	2018	.
1002	2019	12
;

/* assumes sorted by ID, if not*/
proc sort data=have;
   by id year;
run;

data want;
   set have;
   retain out_prev;
   by id;
   if first.id then call missing(out_prev);
   output;
   if not missing(output) then out_prev=output;
run;

Retain keeps values of variables across the data step boundary.

Placing the OUTPUT statement to write to the data before assigning the "new" value for output (see what I meant about confusion) means the the record is written with the last value of output_prev.

View solution in original post

1 REPLY 1
ballardw
Super User

@elbarto wrote:

I have a dataset, an excerpt of it is given below:

 

DATA have;
id	year	output
;
DATALINES;
1001	1987	.
1001	1988	.
1001	1989	.
1001	1990	5
1001	1991	.
1001	1992	.
1001	1993	.
1001	1994	34
1001	1995	22
1001	1996	33
1001	1997	15
1001	1998	.
1001	1999	.
1001	2000	.
1001	2001	23
1001	2002	.
1001	2003	45
1001	2004	23
1001	2005	12
1001	2006	.
1001	2007	.
1001	2008	.
1001	2009	.
1001	2010	2
1001	2011	.
1001	2012	.
1001	2013	56
1001	2014	.
1001	2015	.
1001	2016	.
1001	2017	.
1001	2018	23
1001	2019	.
1002	1987	34
1002	1988	.
1002	1989	12
1002	1990	13
1002	1991	55
1002	1992	32
1002	1993	.
1002	1994	.
1002	1995	54
1002	1996	64
1002	1997	.
1002	1998	.
1002	1999	23
1002	2000	.
1002	2001	.
1002	2002	.
1002	2003	.
1002	2004	64
1002	2005	12
1002	2006	.
1002	2007	.
1002	2008	.
1002	2009	.
1002	2010	3
1002	2011	.
1002	2012	.
1002	2013	.
1002	2014	.
1002	2015	.
1002	2016	.
1002	2017	4
1002	2018	.
1002	2019	12
;
RUN;

 

Each id always has year going from 1987 to 2019 (no gaps). I want to "lag" the output variable to produce the following dataset

 

DATA want;
id	year	output	output_prev;
DATALINES;
1001	1987	.	.
1001	1988	.	.
1001	1989	.	.
1001	1990	5	.
1001	1991	.	5
1001	1992	.	5
1001	1993	.	5
1001	1994	34	5
1001	1995	22	34
1001	1996	33	22
1001	1997	15	33
1001	1998	.	15
1001	1999	.	15
1001	2000	.	15
1001	2001	23	15
1001	2002	.	23
1001	2003	45	23
1001	2004	23	45
1001	2005	12	23
1001	2006	.	12
1001	2007	.	12
1001	2008	.	12
1001	2009	.	12
1001	2010	2	12
1001	2011	.	2
1001	2012	.	2
1001	2013	56	2
1001	2014	.	56
1001	2015	.	56
1001	2016	.	56
1001	2017	.	56
1001	2018	23	56
1001	2019	.	23
1002	1987	34	.
1002	1988	.	34
1002	1989	12	34
1002	1990	13	12
1002	1991	55	13
1002	1992	32	55
1002	1993	.	32
1002	1994	.	32
1002	1995	54	32
1002	1996	64	54
1002	1997	.	64
1002	1998	.	64
1002	1999	23	64
1002	2000	.	23
1002	2001	.	23
1002	2002	.	23
1002	2003	.	23
1002	2004	64	23
1002	2005	12	64
1002	2006	.	12
1002	2007	.	12
1002	2008	.	12
1002	2009	.	12
1002	2010	3	12
1002	2011	.	3
1002	2012	.	3
1002	2013	.	3
1002	2014	.	3
1002	2015	.	3
1002	2016	.	3
1002	2017	4	3
1002	2018	.	4
1002	2019	12	4
;
RUN;

It seems simple but I can't seem to do it. I want to lag the output variable so that each year for a given id always takes the "latest" value of output. As an example, for id=1001, year=1991, output_prev=5 because that was the latest value of output (last updated in year 1990). This continues until we reach year 1995, where output_prev changes to 34 because the latest value of output is 34 (last updated in year 1994).

 

 


Please make sure that your example data step code runs. You are missing an INPUT statement.

For this data you actually want a RETAIN to hold the value across iterations of the data step and know when to write the output record to the data. (BTW naming a variable OUTPUT makes this somewhat confusing as Output is a programming statement as well and could lead to all sorts of miscommunication).

This seems to create your "want" from the example.

DATA have;
input id	year	output
;
DATALINES;
1001	1987	.
1001	1988	.
1001	1989	.
1001	1990	5
1001	1991	.
1001	1992	.
1001	1993	.
1001	1994	34
1001	1995	22
1001	1996	33
1001	1997	15
1001	1998	.
1001	1999	.
1001	2000	.
1001	2001	23
1001	2002	.
1001	2003	45
1001	2004	23
1001	2005	12
1001	2006	.
1001	2007	.
1001	2008	.
1001	2009	.
1001	2010	2
1001	2011	.
1001	2012	.
1001	2013	56
1001	2014	.
1001	2015	.
1001	2016	.
1001	2017	.
1001	2018	23
1001	2019	.
1002	1987	34
1002	1988	.
1002	1989	12
1002	1990	13
1002	1991	55
1002	1992	32
1002	1993	.
1002	1994	.
1002	1995	54
1002	1996	64
1002	1997	.
1002	1998	.
1002	1999	23
1002	2000	.
1002	2001	.
1002	2002	.
1002	2003	.
1002	2004	64
1002	2005	12
1002	2006	.
1002	2007	.
1002	2008	.
1002	2009	.
1002	2010	3
1002	2011	.
1002	2012	.
1002	2013	.
1002	2014	.
1002	2015	.
1002	2016	.
1002	2017	4
1002	2018	.
1002	2019	12
;

/* assumes sorted by ID, if not*/
proc sort data=have;
   by id year;
run;

data want;
   set have;
   retain out_prev;
   by id;
   if first.id then call missing(out_prev);
   output;
   if not missing(output) then out_prev=output;
run;

Retain keeps values of variables across the data step boundary.

Placing the OUTPUT statement to write to the data before assigning the "new" value for output (see what I meant about confusion) means the the record is written with the last value of output_prev.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 346 views
  • 0 likes
  • 2 in conversation