Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Re: Logistic Regression with Instrumental Variable

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-16-2014 11:51 PM
(7945 views)

Hello;

I am trying to regress a *Ratio* variable,Y, on an independent variable,X;

Variable X is endogenous, so I have to use an Instrumental Variable, Z;

both X and Z are *continues* variables.

How can I run this model in SAS. I know SYSLIN easily does iIMt if the depandant variable is continues, but what if the dependent variable is a ratio or dichotomous?

PROC QLIM only allows the left hand variable to be a binary outcome and endogenous! again, when the right hand side variable is endogenous, QLIM is not useful.

If there is no built in procedure, is there any alternative way to run this model in SAS?

Thanks for your help in advance.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Niam,

Your understanding is correct, you should obtain the residuals for each reduced form model (as many as the number of endogenous explanatory variables) -- this makes up your first step-- and then insert them for the error term of the structural model and estimate -- this is the second step. A simple test on the coefficients of these residuals will give you a test of endogeneity.

You also asked a very good question that I should have explained before. The control function approach is used and valid when the model of interest (the structural model) is nonlinear and the endogenous explanatory variables are all continuous. Let me emphasize this one more time: When you estimate a nonlinear model with endogenous explanatory variables, the nature of the endogenous explanatory variables matters. For control function method to produce a consistent estimator, the corresponding reduced form equations must be linear. In your example this is the case, so you can use either a joint likelihood method, like in the QLIM example I wrote earlier, or a control function method.

12 REPLIES 12

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Can you be a bit more explicit in describing the ratio variable Y? Is it bounded (either above or below or both)? Is it a proportion or pseudo-proportion(bounded below by 0 and above by 1)?

I think the best you might do with PROC QLIM is to just treat Y as a continuous variable, with censoring/cutoff (I could be wrong about this). If that is unsatisfactory, you may wish to consider a generalized linear (mixed) modeling procedure, such as GENMOD or GLIMMIX, with an appropriate distributional assumption.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Dear Steven;

Y represents the number of successes divided by the total number of trials in a sample. I tried the QLIM with bounded dependent variable (0<Y<1) but I am not sure why the estimation results are different from Proc Logistic results. (before adjusting for the endogenous variable, that is modeling Y as a function of only X and ignoring endogeneity).

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Niam,

Actually this was true prior to the latest release of ETS. As of the 13.1 release QLIM supports a number of models with RHS endogenous regressors, including logit and probit models.

Please see this documentation and let me know you would like any help using it.

Ken

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Dear ets_kps;

Thank you very much for your helpful answer.

I tried to use the QLIM procedure. Since Y is a proportion variable (Ratio of number of success to the total number of trials), I first tried to use the following code:

Proc Qlim;

model Y=X /discrete(d=logit);

model X=Z;

run;

but I get the following error message:

"There are only 51 non missing observation in the input data set. It is too small to estimate parameters."

Then I tried to define Y as a censored dependent variable and used the following code:

Proc Qlim;

model Y=X/censored(lb=0 ub=1);

model X=Z;

run;

This model runs and gives me some results, however, the results are not consistent with another alternative way that I implemented. As I told Steve, even the simple regression (by ignoring the second model for endogenous variable) produces different results in proc Logistic. This is the alternative method:

Proc reg;

model X=Z;

output out=temp pred=Xhat;

run;

proc logistic data=temp;

model Y=Xhat;

run;

This is basically some variation of 2SLS method. I get the estimates of the endogenous variables from the first OLS regression and then use it in the second logistic regression.

Why do you this the two methods produce different results? Am I using proc QLIM correctly? Is using censored option the best way to handle proportion variables in QLIM? or Is there any problem with the second alternative method?

Thank you very much for your help;

Best

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Dear Steve and ets_kps;

If I use a binary dependent variable instead of a proportion variable, then proc QLIM (with **/discrete (d=logit)** option )and proc LOGISTIC produce identical results. So the problem is how to model proportion variables with QLIM, since the option of **/censored(lb=0 ub=1) **in proc QLIM does not produce similar results to the proc LOGISTIC results.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

While you can do the manual two-stage regression, doesn't it run into the problem that the amount of data (or complete cases, not sure which) that PROC QLIM complains about will result in unstable estimates with unacceptably large standard errors?

I don't think the censored option is a good way to proceed at this point. That leaves the manual 2SLS, or maybe a Bayesian approach in QLIM.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I think, when the dependent variable is a fractional response variable, it should be modeled as truncated rather than censored. Because, with censoring with lower bound 0 and upper bound 1, you are saying that observations that are negative or bigger than 1 actually exist but you are not able to observe them in your sample. With truncation with lower bound 0 and upper bound 1, you are saying that the support of the distribution is [0, 1] and observations can't exist beyond these boundries. Hence,

Proc Qlim;

model Y=X / truncated(lb=0 ub=1);

model X=Z;

run;

may fit your data better.

When it comes to estimating this model with endogeneity using a two-step method, a control function method (which is also a two-step procedure) works BUT the procedure that you described earlier won't work. When you plug in the estimates of the endogenous variables obtained in the first step and estimate the nonlinear model in the second step will produce an inconsistent estimator. Instead of estimating for the endogenous variables you should estimate the error term of the reduced form model x=z; Here is what I mean:

X is endogenous if the error term of the structural model, say u, is correlated with that of the reduced form model, say e. We can model this as

u = theta v + e, where e is independent of v and theta is the correlation coefficient.

Therefore, you can write the model of interest (the structural model) as

Y = beta X + theta v + e

Y is fractional so it's a nonlinear model.

Now, v is unobserved, so it should be replaced with its estimate. This, you should obtain in the first step and then plug it in in the above model and estimate it appropriately. As far as I know, PROC LOGISTIC doesn't estimate fractional response variables, but I am not so sure, you may want to check on this.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Dear Gunce@sas

Thanks for the detailed explanation.

To make sure I have understood correctly, this is the way you think the control function be implemented?

(Basically, instead of having the estimated endogenous variable, I should save the residuals and then use it along with the original endogenous variable in the final model)

Is this the case for every other type of the endogenous variables, or you are proposing this only for the logistic regression because of its non-linear format?

Proc reg;

model X=Z;

output out=temp pred=Xhat residual=vhat;

run;

proc logistic data=temp;

model Y=X vhat;

run;

By the way, in proc logistics, if you have the number of tirals and events, say N and S then you can write:

proc logistic;

model S/N=X;

run;

Actually, I have the trials and events and use the method above, but for the sake of brevity, I just said lets suppose that we have proportional dependent variable.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Niam,

Your understanding is correct, you should obtain the residuals for each reduced form model (as many as the number of endogenous explanatory variables) -- this makes up your first step-- and then insert them for the error term of the structural model and estimate -- this is the second step. A simple test on the coefficients of these residuals will give you a test of endogeneity.

You also asked a very good question that I should have explained before. The control function approach is used and valid when the model of interest (the structural model) is nonlinear and the endogenous explanatory variables are all continuous. Let me emphasize this one more time: When you estimate a nonlinear model with endogenous explanatory variables, the nature of the endogenous explanatory variables matters. For control function method to produce a consistent estimator, the corresponding reduced form equations must be linear. In your example this is the case, so you can use either a joint likelihood method, like in the QLIM example I wrote earlier, or a control function method.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Perfect! Thank you very much!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks @gunce@sas! That was an excellent explanation and I can now say my knowledge of 2SLS is much improved. Well presented.

Steve Denham

**Don't miss out on SAS Innovate - Register now for the FREE Livestream!**

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.