Re: Complete data history requirement for regression in SAS

Tim_sas_ · Posted 07-18-2020 06:13 AM

Hello,

I have a very basic question regarding a linear regression. I want to regress earnings on 3 lagged variables (please see code below) using the data from 2001-10. The problem is that the history of many firms is not complete (i.e. some firms show missing observations for certain years between 2001 and 2010).

Hence, I need a condition which states that firm histories (years) between the startyear (2001) and the endyear (2010) need to be complete.

I would really appreciate if someone could please help me out how to implement this additional condition.

Many thanks in advance,

Tim

*Run a regression of earnings on lagged variables;
proc reg data=earn ;
	model earn_w = Var1 Var2 Var3;
	where 2001 <= year <= 2010;
* Here I need an additional condition which states that all HouFyears between 2003 and 2010 should exist (complete history requirement);
	ODS OUTPUT parameterestimates = par_est;
run; quit;

PaigeMiller · Posted 07-18-2020 07:35 AM

You'd have to remove the un-desired firms in a DATA step or using PROC SQL before getting to PROC REG. I don't think you can do this inside PROC REG.

--
Paige Miller

Tim_sas_ · Posted 07-18-2020 07:52 AM

Thank you very much for your response, PaigeMiller.

The problem is that I want to automate the regression using a macro/loop function, such that you can choose different time windows for the regression. In this case, it would be a problem to remove the undesired firms in a previous step, because it would depend on the time window I choose or is that a mistake in my chain of thought?

Many thanks,

Tim

FreelanceReinh · Posted 07-18-2020 08:24 AM

Hello @Tim_sas_,

You could create flag variables (with values 0 and 1) for different time windows, e.g., compl_2001_2010=1 for firms with complete data from 2001 through 2010. The definition of the flags depends on your data and criteria. Here is a hypothetical example creating one flag in a view (which is evaluated only during the PROC REG step):

proc sql;
create view earnv as
select * from earn e, 
  (select firm, n(earn_w)=10 as compl_2001_2010 from earn
   where 2001 <= year <= 2010
   group by firm) s
where e.firm=s.firm;
quit;

Then you would use earnv instead of earn in the PROC REG statement and enhance the WHERE statement:

where 2001 <= year <= 2010 & compl_2001_2010;

The values "2001" and "2010" can be replaced by macro variable references both in the inequality and in the variable name.

PaigeMiller · Posted 07-18-2020 09:50 AM

@Tim_sas_ wrote:

Thank you very much for your response, PaigeMiller.

The problem is that I want to automate the regression using a macro/loop function, such that you can choose different time windows for the regression. In this case, it would be a problem to remove the undesired firms in a previous step, because it would depend on the time window I choose or is that a mistake in my chain of thought?

This is not correct. The macro could include a DATA step or PROC SQL that processes the data to have only the desired time window.

--
Paige Miller

mkeintz · Posted 07-18-2020 06:47 PM

By "history of many firms is not complete", do you mean that a particular annual record is completely absent, or might you have a year in which some variables are present, but one or more regression variables is absent?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tim_sas_ · Posted 07-19-2020 12:19 AM

Hello mkeintz,

thank you for your response. Unfortunately, it could be one or the other.

Many thanks,

Tim

mkeintz · Posted 07-20-2020 01:38 PM

So, if your time range is 2001 through 2010, and the dependent variable is EARN_W, and the regression variables are VAR1 VAR2 VAR3, then you want to submit to the regression data for firms with the following data properties:

There is a single record for each year during 2001-2010, meaning there are exactly 10 records where 2001<=YEAR<=2010.
For the variables EARN_W, VAR1, VAR2, and VAR3 there are no missing values in the records satisfying #1 above.

So here's a macro definition the contains the necessary code:

%macro myreg(dsn=,depvars=,regvars=,begyear=,endyear=);
  %let nyears_target = %eval(1+&endyear-&begyear);

  data need  / view=need;
    /* For each firm, read and count qualifying years and relevant missing values */
    do nyear=1 by 1 until (last.firmid);
      set &dsn (keep=firmid year &depvars &regvars);
      by firmid;
      where &begyear <= year <= &endyear;
      total_nmiss=sum(total_nmiss,nmiss(of &depvar &regvars));
    end;
    label nyear = "Number of records for this firm from &begyear through &endyear" ;

    /* Now reread the above records and output for qualifying FIRMID's */
    do until (last.firmid);
      set &dsn (keep=firmid year &depvars &regvars);
      by id;
      where &begyear <= year <= &endyear;
      if nyear= &nyears_target then output;
    end;
  run;

  proc reg data=need;
    model &depvars = &regvars ;
    run;
  quit;
%mend myreg ;

Then you can run the regression for the desired subset of firm via something like:

%myreg(dsn=earn,depvar=earn_w,regavars=var1 var2 var3,begyear=2003,endyear=2010);

The macro parameters are:

dsn Original dataset name.
depvar Name of the dependent variable (or a space separated list of multiple dependent variables).
regvars Space-separated list of the regressor varialbes.
begyear and endyear The lower and upper limits of years in which a complete series is required.

This program assumes your original data set (EARN in the above example) are sorted by FIRMID (but not necessarily sorted by year within FIRMID).

Edit note: I made some corrections to the code (consistent spelling of DEPVARS and REGVARS), but the logic in unchanged.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tim_sas_ · Posted 07-20-2020 02:19 PM

Thank you very much for this helpful response, mkeintz!

I have already used your code and replaced firmid by gvkey and year by houfyear (the original variable declarations). I also sorted the data accordingly by gvkey and replaced "data need / view=need;" by "data earnpred2 / view=earnpred2;". The code is actually working and runs a regression, but unfortunately, it is always the same regression and it does not adjust for the time interval properly (begyear to endyear).

In the results log it shows "ERROR: UPDATE views are not supported."

I have already tried to Google this but it is still not clear to me what went wrong. Do you possibly have an idea?

Many thanks,

Tim

%macro myreg(dsn=,depvar=,regvars=,begyear=,endyear=);
  %let nyears_target = %eval(1+&endyear-&begyear);

  data earnpred2  / view=earnpred2;
    /* For each firm, read and count qualifying years and relevant missing values */
    do nyear=1 by 1 until (last.gvkey);
      set &dsn (keep=gykey houfyear &depvar &regvars);
      by gvkey;
      where &begyear <= year <= &endyear;
      total_nmiss=sum(total_nmiss,nmiss(of &regvars));
    end;
    label nyear = "Number of records for this firm from &begyear through &endyear" ;

    /* Now reread the above records and output for qualifying FIRMID's */
    do until (last.gvkey);
      set &dsn (keep=gvkey year &depvar &regvars);
      by gvkey;
      where &begyear <= year <= &endyear;
      if nyear= &nyears_target then output;
    end;
  run;

  proc reg data=earnpred2;
    model &depvar = &regvars ;
    run;
  quit;
%mend;

%myreg(dsn=earnpred2,depvar=earn_w, regvars=L1_earn_w L1_negE_w L1_NExE_w, begyear=2001, endyear=2010);

mkeintz · Posted 07-20-2020 11:40 PM

What adjustment for time interval do you expect? The code you report shows a macro call asking for 2001 through 2010. If you expect other time intervals, you would need to change the begyear and endyear parameters in the macro call.

And in general, diagnosing this sort of problem is what the log is for. Please show the log, containing the code, as well as the log notes. But precede the invocation of the macro call with the statement OPTIONS MPRINT; which tells sas to print the code implemented by the macro.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 07-21-2020 02:51 AM

@Tim_sas_ wrote:

(...)

I ... replaced "data need / view=need;" by "data earnpred2 / view=earnpred2;". (...)

In the results log it shows "ERROR: UPDATE views are not supported."
  data earnpred2  / view=earnpred2;
  ...
      set &dsn (keep=gykey houfyear &depvar &regvars);
...
      set &dsn (keep=gvkey year &depvar &regvars);
  run;

%myreg(dsn=earnpred2,depvar=earn_w, regvars=L1_earn_w L1_negE_w L1_NExE_w, begyear=2001, endyear=2010);

The change you mentioned led to the error message. The name of the view (defined in the DATA statement) must not be the same as the name of your input dataset in the same library because this would cause a name conflict. For the compiler this looks like a recursive reference to the view. Just use a different name for the view as @mkeintz suggested.

Registration is open

SAS Training: Just a Click Away