06-13-2016 01:10 PM
Hello fellow SAS users!
I need some quick help writing a code that can produce estimated coefficients from an OLS Regression model each year using only the preceeding years' data.
The model is:
DV = var1 + var2
The code is relatively straightforward:
PROC REG DATA=WRDS_OUTPUT PLOTS=NONE OUTEST=PARAM; MODEL DV = VAR1 VAR2; BY YEAR INDUSTRY; RUN;
My dilemma is that I need to estimate the constant and coefficients on VAR1 and VAR2 using data through year t-1, which means the sample period will differ each year.
Is there an easy way to incorporate this?
06-13-2016 02:43 PM
Let's assume, your WRDS_OUTPUT dataset contains data from 2001 through 2004. Now you could create three models per value of variable INDUSTRY:
1. one for 2002 using the data with YEAR=2001 of the respective INDUSTRY
2. one for 2003 using the data with YEAR in (2001, 2002) of the respective INDUSTRY
3. one for 2004 using the data with YEAR in (2001, 2002, 2003) of the respective INDUSTRY
If this is what you want, I suggest:
proc sort data=wrds_output; by industry; run; data _null_; do year=2002 to 2004; call execute(cats('proc reg data=wrds_output(where=(.z<year<',year,')) plots=none outest=param',year,';')); call execute('model dv = var1 var2;'); call execute('by industry;'); call execute('run;'); end; run;
The above code creates work datasets PARAM2002, PARAM2003 and PARAM2004 corresponding to items 1 - 3 above. You can concatenate these datasets easily, if needed:
data param; length year 8; set param2002-param2004 indsname=dsn; year=input(compress(dsn,,'kd'), 4.); run;
06-21-2016 12:59 PM
Thanks for the reply. This code worked perfectly and I thank you for it. I have one question, though you may not be able to answer without having access to the data.
First, keep in mind I have already removed observations with missing values from the data set.
Second, my actual data set is from 1957 - 2016; I want to begin regressions as of 1972, so I adjusted the "do fyear=" statement of your code. It still worked.
The problem, though, is that until 1999 the regression doesn't run, saying there are no valid observations for each BY group. I thought this could be a problem of not having enough observations per industry classification, but it is giving the message for ALL industry groups. (Also, it gives the same message if I leave out the BY statement altogether.)
Obviously this could be an issue for me to resolve looking through the data a bit more. But I suppose my question is, does your code inherently require a number of prior years in order to run?
06-21-2016 03:12 PM
This must be a data issue. Even a single observation with non-missing values of DV, VAR1 and VAR2 (in the respective BY group) would prevent the message "ERROR: No valid observations are found." (The results based on a single observation would not be very useful, though.) So, in your case for many YEARs and apparently all values of INDUSTRY all observations must have missing values of DV, VAR1 or VAR2. But if you "have already removed observations with missing values from the data set" (as you wrote), this situation should not occur.
My test data had at least four observations without missing values in each BY group (which is the minimum non-degenerate case).
06-21-2016 03:13 PM
This was my thought, too. Just wanted to make sure.
I will give the data a thorough going-over.