Hello,
I have a very basic question regarding a linear regression. I want to regress earnings on 3 lagged variables (please see code below) using the data from 2001-10. The problem is that the history of many firms is not complete (i.e. some firms show missing observations for certain years between 2001 and 2010).
Hence, I need a condition which states that firm histories (years) between the startyear (2001) and the endyear (2010) need to be complete.
I would really appreciate if someone could please help me out how to implement this additional condition.
Many thanks in advance,
Tim
*Run a regression of earnings on lagged variables;
proc reg data=earn ;
model earn_w = Var1 Var2 Var3;
where 2001 <= year <= 2010;
* Here I need an additional condition which states that all HouFyears between 2003 and 2010 should exist (complete history requirement);
ODS OUTPUT parameterestimates = par_est;
run; quit;
You'd have to remove the un-desired firms in a DATA step or using PROC SQL before getting to PROC REG. I don't think you can do this inside PROC REG.
Thank you very much for your response, PaigeMiller.
The problem is that I want to automate the regression using a macro/loop function, such that you can choose different time windows for the regression. In this case, it would be a problem to remove the undesired firms in a previous step, because it would depend on the time window I choose or is that a mistake in my chain of thought?
Many thanks,
Tim
Hello @Tim_sas_,
You could create flag variables (with values 0 and 1) for different time windows, e.g., compl_2001_2010=1 for firms with complete data from 2001 through 2010. The definition of the flags depends on your data and criteria. Here is a hypothetical example creating one flag in a view (which is evaluated only during the PROC REG step):
proc sql;
create view earnv as
select * from earn e,
(select firm, n(earn_w)=10 as compl_2001_2010 from earn
where 2001 <= year <= 2010
group by firm) s
where e.firm=s.firm;
quit;
Then you would use earnv instead of earn in the PROC REG statement and enhance the WHERE statement:
where 2001 <= year <= 2010 & compl_2001_2010;
The values "2001" and "2010" can be replaced by macro variable references both in the inequality and in the variable name.
@Tim_sas_ wrote:
Thank you very much for your response, PaigeMiller.
The problem is that I want to automate the regression using a macro/loop function, such that you can choose different time windows for the regression. In this case, it would be a problem to remove the undesired firms in a previous step, because it would depend on the time window I choose or is that a mistake in my chain of thought?
This is not correct. The macro could include a DATA step or PROC SQL that processes the data to have only the desired time window.
By "history of many firms is not complete", do you mean that a particular annual record is completely absent, or might you have a year in which some variables are present, but one or more regression variables is absent?
Hello mkeintz,
thank you for your response. Unfortunately, it could be one or the other.
Many thanks,
Tim
So, if your time range is 2001 through 2010, and the dependent variable is EARN_W, and the regression variables are VAR1 VAR2 VAR3, then you want to submit to the regression data for firms with the following data properties:
So here's a macro definition the contains the necessary code:
%macro myreg(dsn=,depvars=,regvars=,begyear=,endyear=);
%let nyears_target = %eval(1+&endyear-&begyear);
data need / view=need;
/* For each firm, read and count qualifying years and relevant missing values */
do nyear=1 by 1 until (last.firmid);
set &dsn (keep=firmid year &depvars ®vars);
by firmid;
where &begyear <= year <= &endyear;
total_nmiss=sum(total_nmiss,nmiss(of &depvar ®vars));
end;
label nyear = "Number of records for this firm from &begyear through &endyear" ;
/* Now reread the above records and output for qualifying FIRMID's */
do until (last.firmid);
set &dsn (keep=firmid year &depvars ®vars);
by id;
where &begyear <= year <= &endyear;
if nyear= &nyears_target then output;
end;
run;
proc reg data=need;
model &depvars = ®vars ;
run;
quit;
%mend myreg ;
Then you can run the regression for the desired subset of firm via something like:
%myreg(dsn=earn,depvar=earn_w,regavars=var1 var2 var3,begyear=2003,endyear=2010);
The macro parameters are:
This program assumes your original data set (EARN in the above example) are sorted by FIRMID (but not necessarily sorted by year within FIRMID).
Edit note: I made some corrections to the code (consistent spelling of DEPVARS and REGVARS), but the logic in unchanged.
Thank you very much for this helpful response, mkeintz!
I have already used your code and replaced firmid by gvkey and year by houfyear (the original variable declarations). I also sorted the data accordingly by gvkey and replaced "data need / view=need;" by "data earnpred2 / view=earnpred2;". The code is actually working and runs a regression, but unfortunately, it is always the same regression and it does not adjust for the time interval properly (begyear to endyear).
In the results log it shows "ERROR: UPDATE views are not supported."
I have already tried to Google this but it is still not clear to me what went wrong. Do you possibly have an idea?
Many thanks,
Tim
%macro myreg(dsn=,depvar=,regvars=,begyear=,endyear=); %let nyears_target = %eval(1+&endyear-&begyear); data earnpred2 / view=earnpred2; /* For each firm, read and count qualifying years and relevant missing values */ do nyear=1 by 1 until (last.gvkey); set &dsn (keep=gykey houfyear &depvar ®vars); by gvkey; where &begyear <= year <= &endyear; total_nmiss=sum(total_nmiss,nmiss(of ®vars)); end; label nyear = "Number of records for this firm from &begyear through &endyear" ; /* Now reread the above records and output for qualifying FIRMID's */ do until (last.gvkey); set &dsn (keep=gvkey year &depvar ®vars); by gvkey; where &begyear <= year <= &endyear; if nyear= &nyears_target then output; end; run; proc reg data=earnpred2; model &depvar = ®vars ; run; quit; %mend; %myreg(dsn=earnpred2,depvar=earn_w, regvars=L1_earn_w L1_negE_w L1_NExE_w, begyear=2001, endyear=2010);
What adjustment for time interval do you expect? The code you report shows a macro call asking for 2001 through 2010. If you expect other time intervals, you would need to change the begyear and endyear parameters in the macro call.
And in general, diagnosing this sort of problem is what the log is for. Please show the log, containing the code, as well as the log notes. But precede the invocation of the macro call with the statement OPTIONS MPRINT; which tells sas to print the code implemented by the macro.
@Tim_sas_ wrote:
(...)
I ... replaced "data need / view=need;" by "data earnpred2 / view=earnpred2;". (...)
In the results log it shows "ERROR: UPDATE views are not supported."
data earnpred2 / view=earnpred2; ... set &dsn (keep=gykey houfyear &depvar ®vars); ...
set &dsn (keep=gvkey year &depvar ®vars); run; %myreg(dsn=earnpred2,depvar=earn_w, regvars=L1_earn_w L1_negE_w L1_NExE_w, begyear=2001, endyear=2010);
The change you mentioned led to the error message. The name of the view (defined in the DATA statement) must not be the same as the name of your input dataset in the same library because this would cause a name conflict. For the compiler this looks like a recursive reference to the view. Just use a different name for the view as @mkeintz suggested.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.