turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Sum of variables from t-2 to t

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-30-2014 02:02 AM

Dear all,

I would like to compute the total of a variable starting from t-2 to t for a company (gvkey). It seems that proc expand can perform time series total from lag variables.

My data (attached) is presented as follows:

GVKEY FYEAR TXPD

001004 1995 5.3

001004 1996 8.6

001004 1997 6.2

001004 1998 4.4

001004 1999 11.3

001004 2000 3.2

....

I would like to generate a new variable, namely sum_TXPD3. The new variable is computed by totaling the TXPD for 3 years starting from t-2 to t (i.e., t-2, t-1, and t). For example, for the company with the gvkey code 001004, the sum_TXPD3 should be blank for fyear 1995 and 1996 as there is no sufficient data to compute the variable. For the year 1997, the sum_TXPD3 should be 20.1 (5.3 + 8.6 + 6.2).

There might be some missing data in the between of the time series. For example, if the fyear 1996 is missing, then the sum_TXPD3 for fyear 1997 should be stated as ".".

Hope anyone here to help me to resolve the problem. I am not sure proc expand can help in this problem?

Thank you in advanced.

Regards,

mspak

Accepted Solutions

Solution

11-30-2014
07:17 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-30-2014 07:17 PM

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-30-2014 09:08 AM

data want;

set have;

sum_TXPD3=TXPD+lag(TXPD)+lag2(TXPD);

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-30-2014 10:04 AM

I think something like the following will accomplish what you want:

data want;

set test.sampledata;

by gvkey;

if first.gvkey then counter=1;

else counter+1;

sum_TXPD3=ifn(counter gt 2,TXPD+lag(TXPD)+lag2(TXPD),.);

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-30-2014 11:49 AM

An SQL alternative could be (not tested):

proc sql;

create table want as

select *, (select sum(TXPD) from have where gvkey=a.gvkey and fyear <= a.fyear having count(fyear) >=3) as rolling_sum

from have a

;

quit;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-30-2014 02:40 PM

There are a number of similar questions posted in the forums over time. Searching for "rolling" provided links to interesting solutions.

I don't think an sql approach has bein proposed - perhaps because it would be impossible to expand much beyond 10 period lag.

Select a.*, (a.txpd,b.txpd,c.txpd) as sum_txpd3

from sampledata a

left join sampledata b

on a.gvkey EQ b.gvkey

and a.fyear EQ (b.fyear+1)

left join sampledata c

on a.gvkey EQ c.gvkey

and a.fyear EQ (c.fyear+2)

Solution

11-30-2014
07:17 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-30-2014 07:17 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-03-2014 09:45 PM

Thank you to all the helpers,

I think Peter's solution is the simplest perhaps. I have another problem is that not all the observations with non-missing fyear+1 and fyear+2. In order to mitigate this problem and to ensure all the gvkey with non-missing years, I perform the following:

proc sql;

create table a as

select *

from (select distinct FYEAR from SAMPLEDATA),

(select distinct GVKEY from SAMPLEDATA) ;

quit;

data want;

merge a SAMPLEDATA;

by gvkey FYEAR;

run;

However, if I wish to exclude missing observations from a given year, how should I do that? For example, if I wish to compute the sum_txpd3 for a given gvkey for year 2013, I must have non-missing txpd for year 2011, 2012 and 2013. If the data for 2012 is missing, then the sum_txpd3 should be left as ".". With the code given by Peter, the the total of txpd for years 2011 to 2013 still will be calculated even if there is one missing txpd, lets say 2012.

Thank you.

Regards,

MSPAK

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-03-2014 10:17 PM

Have you actually tried my suggestion? Not bragging here, but it seems to me that my code is :

1) more dynamic and flexible, you can change parameters to fit your different range, such as 10 years, 100 years etc. 2). is capable to deal with the missing value to meet your criteria. See below:

**data** have;

input GVKEY $ FYEAR TXPD;

cards;

001004 1995 5.3

001004 1996 8.6

001004 1997 6.2

001004 1998 4.4

001004 1999 11.3

001004 2000 .

001004 2001 3.2

001004 2002 3.2

001004 2003 3.2

001004 2004 3.2

;

**proc** **sql**;

create table want as

select *, (select sum(TXPD) from have where gvkey=a.gvkey and a.fyear-**2**<= fyear <= a.fyear having count(TXPD) =**3**) as rolling_sum

from have a

;

**quit**;

Only those with non-missing value for 3 continuous years are computed, else are left missing as you have requested.

Haikuo