***Apologies for cross-posting***
Hello,
I need some help with Heckman’s (1979) 2-stage procedure using a binary dependent variable. Let’s say I regress Y on several explanatory variables using SAS PROC GENMOD (GEE Logit). Y is a binary variable, and refers to the decision to complete or abandon an acquisition (completed=1, abandon=0).
PROC GENMOD DATA=my_filename descending;
CLASS case_number company_id;
model completed = X1*X2 X1 X2 X3 X4
Year1993 Year1994 Year1995 Year1996 Year1997
/DIST=bin LINK=logit;
REPEATED subject=company_id /TYPE=ar(1) ;
run;
The decision to complete or abandon an acquisition may be an endogenous one, in that firms may make the decision to complete the acquisition based on expectations of economic uncertainty, which cannot be measured and cannot be included in the regression equation. Hence, I need to control for sample selection bias (endogeneity) using the 2-stage Heckman (1979) procedure. Sartori (2003) recommends using the exclusion restriction procedure, in which an additional meaningful variable is added to the first-stage selection equation but not to the second-stage equation. In line with this, in the first-stage probit model, I included a categorical variable (CatVar) to satisfy the exclusion restriction requirement. For stage 1, I calculated high acquisition experience dummy variable (1 if the number of previous acquisitions over the past 5 years is greater than the average, 0 if it did not), which served as the dependent variable in the probit model. Since I get the error message "Error in computing the variance function" when I run the Stage 1, I removed explanatory variable X3 which appears to have high collinearity with the dependent variable (HighAcq_dum).
*/Stage 1 Heckman test - Probit specification with self-selection /*
PROC GENMOD DATA=my_filename descending;
CLASS case_number company_id;
model HighAcq_dum = X1 X2 X4 CatVar
Year1993 Year1994 Year1995 Year1996 Year1997
/DIST=bin LINK=probit;
REPEATED subject=company_id /TYPE=ar(1) ;
output out=my_filename.heck prob=pcompleted ;
run;
/*calculate inverse mills ratio */
data my_filename.heck ;
set my_filename.heck ;
IMR = pdf('NORMAL', pcompleted ) / cdf('NORMAL', pcompleted ); run;
/*Stage 2 Heckman test - Include IMR into GEE logit outcome equation*/
PROC GENMOD DATA= my_filename.heck descending;
CLASS case_number company_id;
model completed = X1*X2 X1 X2 X3 X4 IMR
Year1993 Year1994 Year1995 Year1996 Year1997
/DIST=bin LINK=logit;
REPEATED subject=company_id /TYPE=ar(1) ; run;
(1) Would you please let me know if the above SAS codes to execute 2-stage Heckman test are correct? If not, what needs to be changed?
Thanks for your help!
Best,
Elizabeth
Hi Elizabeth,
I have a few comments about your second post.
You are correct that if you are using the HECKIT option of
PROC QLIM then the second stage dependent variable has to be continuous in
nature. However, you can still consistently estimate your model using
PROC QLIM even if you have a binary dependent variable for that model. The SAS
code below estimates your selection model consistently:
PROC QLIM DATA=test ;
/* the selection equation--probit */
MODEL HighAcq_dum = X1*X2 X1 X2 X3 X4 CatVar
Year1993 Year1994 Year1995 Year1996 Year1997 / DISCRETE;
/* the equation of interest */
MODEL completed = X1*X2 X1 X2 X3 X4 Year1993 Year1994 Year1995
Year1996 Year1997 / SELECT(HighAcq_dum=1) DISCRETE(DIST=LOGISTIC);
RUN;
Note that the HECKIT option is not on. This way, the two models
are estimated simultaneously and the endogeneity problem that occurs due to the
selected sample is taken into account. This is a one-step method and if your
model is correct, it’s more efficient than its two-step correspondences.
Now, about your first question, I am not sure if your
two-step method for the model that you are interested in estimating would
produce consistent estimates. In the first step you are estimating the probit
model to calculate the inverse Mills ratio and using it to correct for the bias
in the second stage for your logit model. However, note that, Heckman, in his
1979 article, drives that bias correction, namely the inverse Mills ratio, for
a linear model of interest, i.e., a model with continuous dependent variable,
under some particular distributional assumptions. In other words, the nature of
the bias may depend on the nature of the dependent variable and the
distributional assumptions of the model of interest, and if so, then you are
not correcting for that bias by including the inverse Mills ratio, the bias may
be something different. The two-step method you explained above may cause you to
have inconsistent estimates.
Hello -
Do you have access to SAS/ETS software? If yes, then you may want to look at PROC QLIM instead.
In fact, when using SAS Studio and SAS/ETS combined, you will find a custom task for Heckman Selection Models - see: http://support.sas.com/documentation/cdl/en/webeditorug/67434/HTML/default/viewer.htm#n1918qt6877sgb...
which will create some SAS code for you as a starting point - or you can check out: http://support.sas.com/documentation/cdl/en/etsug/67525/HTML/default/viewer.htm#etsug_qlim_examples0...
More details can be found here: http://support.sas.com/documentation/cdl/en/etsug/67525/HTML/default/viewer.htm#etsug_qlim_details17... - might want to chime in with more details.
Thanks,
Udo
Hi Udo,
Thank you for the weblinks. I am familiar with these materials and have read them. While helpful, these materials deal with selection issues related to continuous dependent variables. For instance, in the sample selection model using the Mroz dataset, the second stage dependent variable is "lwage" which is continuous in nature. I'm looking for guidance on Heckman SAS codes where the dependent variable in the second stage is a binary variable.
I have two follow up questions:-
(1) I'm interestd in using PROC GENMOD to run the Heckman procedure. Are the following SAS codes correct given that I have a binary dependent variable (completed acquisition =1, abandon acquisition=0) in the outcome equation of interest?
*/Stage 1 Heckman test - Probit specification with self-selection /*
PROC GENMOD DATA=my_filename descending;
CLASS case_number company_id;
model HighAcq_dum =X1*X2 X1 X2 X3 X4 CatVar
Year1993 Year1994 Year1995 Year1996 Year1997
/DIST=bin LINK=probit;
REPEATED subject=company_id /TYPE=ar(1) ;
output out=my_filename.heck prob=pcompleted ;
run;
/*calculate inverse mills ratio */
data my_filename.heck ;
set my_filename.heck ;
IMR = pdf('NORMAL', pcompleted ) / cdf('NORMAL', pcompleted ); run;
/*Stage 2 Heckman test - Include IMR into GEE logit outcome equation*/
PROC GENMOD DATA= my_filename.heck descending;
CLASS case_number company_id;
model completed = X1*X2 X1 X2 X3 X4 IMR
Year1993 Year1994 Year1995 Year1996 Year1997
/DIST=bin LINK=logit;
REPEATED subject=company_id /TYPE=ar(1) ; run;
(2) Let's say I want to run the Heckman using PROC QLIM instead. Are the SAS codes below correct? If not, what needs to be changed?
proc qlim data=test heckit;
model HighAcq_dum = X1*X2 X1 X2 X3 X4 CatVar
Year1993 Year1994 Year1995 Year1996 Year1997
/ discrete;/* the selection equation--probit */
model completed = X1*X2 X1 X2 X3 X4
Year1993 Year1994 Year1995 Year1996 Year1997
/ select(HighAcq_dum=1); /* the equation of interest */
run;
Thanks so much for your help!
Best,
Elizabeth
Hi Elizabeth,
I have a few comments about your second post.
You are correct that if you are using the HECKIT option of
PROC QLIM then the second stage dependent variable has to be continuous in
nature. However, you can still consistently estimate your model using
PROC QLIM even if you have a binary dependent variable for that model. The SAS
code below estimates your selection model consistently:
PROC QLIM DATA=test ;
/* the selection equation--probit */
MODEL HighAcq_dum = X1*X2 X1 X2 X3 X4 CatVar
Year1993 Year1994 Year1995 Year1996 Year1997 / DISCRETE;
/* the equation of interest */
MODEL completed = X1*X2 X1 X2 X3 X4 Year1993 Year1994 Year1995
Year1996 Year1997 / SELECT(HighAcq_dum=1) DISCRETE(DIST=LOGISTIC);
RUN;
Note that the HECKIT option is not on. This way, the two models
are estimated simultaneously and the endogeneity problem that occurs due to the
selected sample is taken into account. This is a one-step method and if your
model is correct, it’s more efficient than its two-step correspondences.
Now, about your first question, I am not sure if your
two-step method for the model that you are interested in estimating would
produce consistent estimates. In the first step you are estimating the probit
model to calculate the inverse Mills ratio and using it to correct for the bias
in the second stage for your logit model. However, note that, Heckman, in his
1979 article, drives that bias correction, namely the inverse Mills ratio, for
a linear model of interest, i.e., a model with continuous dependent variable,
under some particular distributional assumptions. In other words, the nature of
the bias may depend on the nature of the dependent variable and the
distributional assumptions of the model of interest, and if so, then you are
not correcting for that bias by including the inverse Mills ratio, the bias may
be something different. The two-step method you explained above may cause you to
have inconsistent estimates.
Hello gunce@sas,
Thank you so much for your helpful response and the codes! I really appreciate this. I have a few follow-up questions.
(1) If I use your recommended SAS codes to run, I get the error message "ERROR: No valid model," and I don't see any parameter estimates. If I remove "(DIST=LOGISTIC)" in the second stage equation, or if I use "DISCRETE(DIST=probit)," I see a display of parameter estimates but I also see the following error messages:
WARNING: The Hessian matrix is singular.
WARNING: The Hessian matrix is singular.
ERROR: QUANEW Optimization cannot be completed.
NOTE: QUANEW needs more than 200 iterations or 1000 function calls.
WARNING: This is an experimental release of the PLOTS option.
It seems to me the program does not like having a logit specification in the second stage. How do I modify the codes so that I can get the program to run properly while retaining the logit model? [Note that my research question suggests that I should run a logit in my equation of interest.]
(2) In the second instance when I was able to get the program to produce parameter estimates despite having error messages, I see that “_Rho” is insignificant. What is “_Rho”? Is it some kind of inverse mills ratio? Does an insignificant _Rho mean that I don’t have a selection bias problem?
(3) To my understanding when the HECKIT option is specified, PROC QLIM automatically reports the corrected standard errors. Are your recommended codes (without HECKIT option) producing corrected or uncorrected standard errors? If the latter, how do I modify the codes to correct for standard errors?
(4) Do your codes account for heteroscedasticity as well? If not, how do I account for heteroscedasticity?
Thank you in advance for your guidance!
Best,
Elizabeth
Hi Elizabeth,
I overlooked the fact that one cannot have multiple equations when (DIST=LOGISTIC) option for a model is specified. Since you have two equations you cannot have that specification. Nevertheless, using a probit model instead of a logit shouldn’t change the results that much, those are very similar models. The warning about Hessian being singular can be due to collinearity
or general identification problem. Without seeing the data set I cannot say much about this problem.
The _Rho parameter is important. It is the correlation coefficient between the errors of the two models. It tells you if you actually have the selection bias in your sample or not. An insignificant _Rho usually implies that you don’t have a selection bias problem in your model of interest or it can imply that your choice of model is not correct.
Standard error correction is necessary when one is using a two-step procedure. If you don’t specify the HECKIT option, then the estimation is done in one step and in that case no correction is needed.
You can account for heteroscedasticity using the HETERO statement.
I hope these help.
Best,
Gunce
Hi gunce @sas,
Thank you once again for a very helpful response! I really appreciate that you graciously took time to help out someone you don't know. I think I now have sufficient info to proceed with my anlysis. I will come back if I've got more questions.
Regards,
Elizabeth
I am trying to do something similar. I have a continuous dep variable and binary ind variable which is TREATED (0/1). I want to determine if I have unmeasured bias.
I first create a PROBIT model and output the estimated probabilities (prob) of being treated.
Next, I calculate the Inverse Mills Ratio:
IMR = pdf('NORMAL', prob ) / cdf('NORMAL', prob ); /*inverse mills ratio*/
Then run my GLM:
proc glm data = weighted_PS;
class RHS;
model LHS = RHS IMR/ ss3 solution;
weight weights;
run;
Is this correct?
Hello -
I don't have an answer for you, but I would suggest to open a new discussion and not to tag your question to this existing discussion.
(see also: https://communities.sas.com/docs/DOC-2263). This will increase visibility.
Thanks,
Udo
It looks correct but without knowing exact details of your models I can't be so sure.
Actually, what you are trying can be achieved by estimating a selection model in PROC QLIM with the HECKIT option on. You need to have two MODEL statements, one specifying the first model that you estimated (using the DISCRETE option) and the other one specifying the second model (the model with the continuous dependent variable) using the SELECT option.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.