Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Forecasting
- /
- Re: 2-stage Heckman (1979) procedure

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-07-2014 08:48 PM
(16160 views)

***Apologies for cross-posting***

Hello,

I need some help with Heckman’s (1979) 2-stage procedure using a binary dependent variable. Let’s say I regress Y on several explanatory variables using SAS PROC GENMOD (GEE Logit). Y is a binary variable, and refers to the decision to complete or abandon an acquisition (completed=1, abandon=0).

PROC GENMOD DATA=my_filename descending;

CLASS case_number company_id;

model completed = X1*X2 X1 X2 X3 X4

Year1993 Year1994 Year1995 Year1996 Year1997

/DIST=bin LINK=logit;

REPEATED subject=company_id /TYPE=ar(1) ;

run;

The decision to complete or abandon an acquisition may be an endogenous one, in that firms may make the decision to complete the acquisition based on expectations of economic uncertainty, which cannot be measured and cannot be included in the regression equation. Hence, I need to control for sample selection bias (endogeneity) using the 2-stage Heckman (1979) procedure. Sartori (2003) recommends using the exclusion restriction procedure, in which an additional meaningful variable is added to the first-stage selection equation but not to the second-stage equation. In line with this, in the first-stage probit model, I included a categorical variable (CatVar) to satisfy the exclusion restriction requirement. For stage 1, I calculated high acquisition experience dummy variable (1 if the number of previous acquisitions over the past 5 years is greater than the average, 0 if it did not), which served as the dependent variable in the probit model. Since I get the error message "Error in computing the variance function" when I run the Stage 1, I removed explanatory variable X3 which appears to have high collinearity with the dependent variable (HighAcq_dum).

*/Stage 1 Heckman test - Probit specification with self-selection /*

PROC GENMOD DATA=my_filename descending;

CLASS case_number company_id;

model HighAcq_dum = X1 X2 X4 CatVar

Year1993 Year1994 Year1995 Year1996 Year1997

/DIST=bin LINK=probit;

REPEATED subject=company_id /TYPE=ar(1) ;

output out=my_filename.heck prob=pcompleted ;

run;

/*calculate inverse mills ratio */

data my_filename.heck ;

set my_filename.heck ;

IMR = pdf('NORMAL', pcompleted ) / cdf('NORMAL', pcompleted ); run;

/*Stage 2 Heckman test - Include IMR into GEE logit outcome equation*/

PROC GENMOD DATA= my_filename.heck descending;

CLASS case_number company_id;

model completed = X1*X2 X1 X2 X3 X4 IMR

Year1993 Year1994 Year1995 Year1996 Year1997

/DIST=bin LINK=logit;

REPEATED subject=company_id /TYPE=ar(1) ; run;

(1) Would you please let me know if the above SAS codes to execute 2-stage Heckman test are correct? If not, what needs to be changed?

Thanks for your help!

Best,

Elizabeth

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Elizabeth,

I have a few comments about your second post.

You are correct that if you are using the HECKIT option of

PROC QLIM then the second stage dependent variable has to be continuous in

nature. However, you can still consistently estimate your model using

PROC QLIM even if you have a binary dependent variable for that model. The SAS

code below estimates your selection model consistently:

PROC QLIM DATA=test ;

/* the selection equation--probit */

MODEL HighAcq_dum = X1*X2 X1 X2 X3 X4 CatVar

Year1993 Year1994 Year1995 Year1996 Year1997 / DISCRETE;

/* the equation of interest */

MODEL completed = X1*X2 X1 X2 X3 X4 Year1993 Year1994 Year1995

Year1996 Year1997 / SELECT(HighAcq_dum=1) DISCRETE(DIST=LOGISTIC);

RUN;

Note that the HECKIT option is not on. This way, the two models

are estimated simultaneously and the endogeneity problem that occurs due to the

selected sample is taken into account. This is a one-step method and if your

model is correct, it’s more efficient than its two-step correspondences.

Now, about your first question, I am not sure if your

two-step method for the model that you are interested in estimating would

produce consistent estimates. In the first step you are estimating the probit

model to calculate the inverse Mills ratio and using it to correct for the bias

in the second stage for your logit model. However, note that, Heckman, in his

1979 article, drives that bias correction, namely the inverse Mills ratio, for

a linear model of interest, i.e., a model with continuous dependent variable,

under some particular distributional assumptions. In other words, the nature of

the bias may depend on the nature of the dependent variable and the

distributional assumptions of the model of interest, and if so, then you are

not correcting for that bias by including the inverse Mills ratio, the bias may

be something different. The two-step method you explained above may cause you to

have inconsistent estimates.

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello -

Do you have access to SAS/ETS software? If yes, then you may want to look at PROC QLIM instead.

In fact, when using SAS Studio and SAS/ETS combined, you will find a custom task for Heckman Selection Models - see: http://support.sas.com/documentation/cdl/en/webeditorug/67434/HTML/default/viewer.htm#n1918qt6877sgb...

which will create some SAS code for you as a starting point - or you can check out: http://support.sas.com/documentation/cdl/en/etsug/67525/HTML/default/viewer.htm#etsug_qlim_examples0...

More details can be found here: http://support.sas.com/documentation/cdl/en/etsug/67525/HTML/default/viewer.htm#etsug_qlim_details17... - might want to chime in with more details.

Thanks,

Udo

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Udo,

Thank you for the weblinks. I am familiar with these materials and have read them. While helpful, these materials deal with selection issues related to continuous dependent variables. For instance, in the sample selection model using the Mroz dataset, the second stage dependent variable is "lwage" which is continuous in nature. I'm looking for guidance on Heckman SAS codes where the dependent variable in the second stage is a binary variable.

I have two follow up questions:-

(1) I'm interestd in using PROC GENMOD to run the Heckman procedure. Are the following SAS codes correct given that I have a binary dependent variable (completed acquisition =1, abandon acquisition=0) in the outcome equation of interest?

*/Stage 1 Heckman test - Probit specification with self-selection /*

PROC GENMOD DATA=my_filename descending;

CLASS case_number company_id;

model HighAcq_dum =X1*X2 X1 X2 X3 X4 CatVar

Year1993 Year1994 Year1995 Year1996 Year1997

/DIST=bin LINK=probit;

REPEATED subject=company_id /TYPE=ar(1) ;

output out=my_filename.heck prob=pcompleted ;

run;

/*calculate inverse mills ratio */

data my_filename.heck ;

set my_filename.heck ;

IMR = pdf('NORMAL', pcompleted ) / cdf('NORMAL', pcompleted ); run;

/*Stage 2 Heckman test - Include IMR into GEE logit outcome equation*/

PROC GENMOD DATA= my_filename.heck descending;

CLASS case_number company_id;

model completed = X1*X2 X1 X2 X3 X4 IMR

Year1993 Year1994 Year1995 Year1996 Year1997

/DIST=bin LINK=logit;

REPEATED subject=company_id /TYPE=ar(1) ; run;

(2) Let's say I want to run the Heckman using PROC QLIM instead. Are the SAS codes below correct? If not, what needs to be changed?

proc qlim data=test heckit;

model HighAcq_dum = X1*X2 X1 X2 X3 X4 CatVar

Year1993 Year1994 Year1995 Year1996 Year1997

/ discrete;/* the selection equation--probit */

model completed = X1*X2 X1 X2 X3 X4

Year1993 Year1994 Year1995 Year1996 Year1997

/ select(HighAcq_dum=1); /* the equation of interest */

run;

Thanks so much for your help!

Best,

Elizabeth

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Elizabeth,

I have a few comments about your second post.

You are correct that if you are using the HECKIT option of

PROC QLIM then the second stage dependent variable has to be continuous in

nature. However, you can still consistently estimate your model using

PROC QLIM even if you have a binary dependent variable for that model. The SAS

code below estimates your selection model consistently:

PROC QLIM DATA=test ;

/* the selection equation--probit */

MODEL HighAcq_dum = X1*X2 X1 X2 X3 X4 CatVar

Year1993 Year1994 Year1995 Year1996 Year1997 / DISCRETE;

/* the equation of interest */

MODEL completed = X1*X2 X1 X2 X3 X4 Year1993 Year1994 Year1995

Year1996 Year1997 / SELECT(HighAcq_dum=1) DISCRETE(DIST=LOGISTIC);

RUN;

Note that the HECKIT option is not on. This way, the two models

are estimated simultaneously and the endogeneity problem that occurs due to the

selected sample is taken into account. This is a one-step method and if your

model is correct, it’s more efficient than its two-step correspondences.

Now, about your first question, I am not sure if your

two-step method for the model that you are interested in estimating would

produce consistent estimates. In the first step you are estimating the probit

model to calculate the inverse Mills ratio and using it to correct for the bias

in the second stage for your logit model. However, note that, Heckman, in his

1979 article, drives that bias correction, namely the inverse Mills ratio, for

a linear model of interest, i.e., a model with continuous dependent variable,

under some particular distributional assumptions. In other words, the nature of

the bias may depend on the nature of the dependent variable and the

distributional assumptions of the model of interest, and if so, then you are

not correcting for that bias by including the inverse Mills ratio, the bias may

be something different. The two-step method you explained above may cause you to

have inconsistent estimates.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello gunce@sas,

Thank you so much for your helpful response and the codes! I really appreciate this. I have a few follow-up questions.

(1) If I use your recommended SAS codes to run, I get the error message "ERROR: No valid model," and I don't see any parameter estimates. If I remove "(DIST=LOGISTIC)" in the second stage equation, or if I use "DISCRETE(DIST=probit)," I see a display of parameter estimates but I also see the following error messages:

WARNING: The Hessian matrix is singular.

WARNING: The Hessian matrix is singular.

ERROR: QUANEW Optimization cannot be completed.

NOTE: QUANEW needs more than 200 iterations or 1000 function calls.

WARNING: This is an experimental release of the PLOTS option.

It seems to me the program does not like having a logit specification in the second stage. How do I modify the codes so that I can get the program to run properly while retaining the logit model? [Note that my research question suggests that I should run a logit in my equation of interest.]

(2) In the second instance when I was able to get the program to produce parameter estimates despite having error messages, I see that “_Rho” is insignificant. What is “_Rho”? Is it some kind of inverse mills ratio? Does an insignificant _Rho mean that I don’t have a selection bias problem?

(3) To my understanding when the HECKIT option is specified, PROC QLIM automatically reports the corrected standard errors. Are your recommended codes (without HECKIT option) producing corrected or uncorrected standard errors? If the latter, how do I modify the codes to correct for standard errors?

(4) Do your codes account for heteroscedasticity as well? If not, how do I account for heteroscedasticity?

Thank you in advance for your guidance!

Best,

Elizabeth

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Elizabeth,

I overlooked the fact that one cannot have multiple equations when (DIST=LOGISTIC) option for a model is specified. Since you have two equations you cannot have that specification. Nevertheless, using a probit model instead of a logit shouldn’t change the results that much, those are very similar models. The warning about Hessian being singular can be due to collinearity

or general identification problem. Without seeing the data set I cannot say much about this problem.

The _Rho parameter is important. It is the correlation coefficient between the errors of the two models. It tells you if you actually have the selection bias in your sample or not. An insignificant _Rho usually implies that you don’t have a selection bias problem in your model of interest or it can imply that your choice of model is not correct.

Standard error correction is necessary when one is using a two-step procedure. If you don’t specify the HECKIT option, then the estimation is done in one step and in that case no correction is needed.

You can account for heteroscedasticity using the HETERO statement.

I hope these help.

Best,

Gunce

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi gunce @sas,

Thank you once again for a very helpful response! I really appreciate that you graciously took time to help out someone you don't know. I think I now have sufficient info to proceed with my anlysis. I will come back if I've got more questions.

Regards,

Elizabeth

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I am trying to do something similar. I have a continuous dep variable and binary ind variable which is TREATED (0/1). I want to determine if I have unmeasured bias.

I first create a PROBIT model and output the estimated probabilities (prob) of being treated.

Next, I calculate the Inverse Mills Ratio:

IMR = pdf('NORMAL', prob ) / cdf('NORMAL', prob ); /*inverse mills ratio*/

Then run my GLM:

proc glm data = weighted_PS;

class RHS;

model LHS = RHS IMR/ ss3 solution;

weight weights;

run;

Is this correct?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello -

I don't have an answer for you, but I would suggest to open a new discussion and not to tag your question to this existing discussion.

(see also: https://communities.sas.com/docs/DOC-2263). This will increase visibility.

Thanks,

Udo

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It looks correct but without knowing exact details of your models I can't be so sure.

Actually, what you are trying can be achieved by estimating a selection model in PROC QLIM with the HECKIT option on. You need to have two MODEL statements, one specifying the first model that you estimated (using the DISCRETE option) and the other one specifying the second model (the model with the continuous dependent variable) using the SELECT option.

**Don't miss out on SAS Innovate - Register now for the FREE Livestream!**

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.