Solved: Re: Logistic Regression with Instrumental Variable

niam · Posted 06-16-2014 11:51 PM

Hello;

I am trying to regress a Ratio variable,Y, on an independent variable,X;

Variable X is endogenous, so I have to use an Instrumental Variable, Z;

both X and Z are continues variables.

How can I run this model in SAS. I know SYSLIN easily does iIMt if the depandant variable is continues, but what if the dependent variable is a ratio or dichotomous?

PROC QLIM only allows the left hand variable to be a binary outcome and endogenous! again, when the right hand side variable is endogenous, QLIM is not useful.

If there is no built in procedure, is there any alternative way to run this model in SAS?

Thanks for your help in advance.

gunce_sas · Posted 06-19-2014 10:12 AM

Hi Niam,

Your understanding is correct, you should obtain the residuals for each reduced form model (as many as the number of endogenous explanatory variables) -- this makes up your first step-- and then insert them for the error term of the structural model and estimate -- this is the second step. A simple test on the coefficients of these residuals will give you a test of endogeneity.

You also asked a very good question that I should have explained before. The control function approach is used and valid when the model of interest (the structural model) is nonlinear and the endogenous explanatory variables are all continuous. Let me emphasize this one more time: When you estimate a nonlinear model with endogenous explanatory variables, the nature of the endogenous explanatory variables matters. For control function method to produce a consistent estimator, the corresponding reduced form equations must be linear. In your example this is the case, so you can use either a joint likelihood method, like in the QLIM example I wrote earlier, or a control function method.

View solution in original post

SteveDenham · Posted 06-17-2014 07:36 AM

Can you be a bit more explicit in describing the ratio variable Y? Is it bounded (either above or below or both)? Is it a proportion or pseudo-proportion(bounded below by 0 and above by 1)?

I think the best you might do with PROC QLIM is to just treat Y as a continuous variable, with censoring/cutoff (I could be wrong about this). If that is unsatisfactory, you may wish to consider a generalized linear (mixed) modeling procedure, such as GENMOD or GLIMMIX, with an appropriate distributional assumption.

Steve Denham

niam · Posted 06-17-2014 05:46 PM

Dear Steven;

Y represents the number of successes divided by the total number of trials in a sample. I tried the QLIM with bounded dependent variable (0<Y<1) but I am not sure why the estimation results are different from Proc Logistic results. (before adjusting for the endogenous variable, that is modeling Y as a function of only X and ignoring endogeneity).

ets_kps · Posted 06-17-2014 05:14 PM

Hi Niam,

Actually this was true prior to the latest release of ETS. As of the 13.1 release QLIM supports a number of models with RHS endogenous regressors, including logit and probit models.

Please see this documentation and let me know you would like any help using it.

SAS/ETS(R) 13.1 User's Guide

Ken

niam · Posted 06-17-2014 06:10 PM

Dear ets_kps;

Thank you very much for your helpful answer.

I tried to use the QLIM procedure. Since Y is a proportion variable (Ratio of number of success to the total number of trials), I first tried to use the following code:

Proc Qlim;

model Y=X /discrete(d=logit);

model X=Z;

run;

but I get the following error message:

"There are only 51 non missing observation in the input data set. It is too small to estimate parameters."

Then I tried to define Y as a censored dependent variable and used the following code:

Proc Qlim;

model Y=X/censored(lb=0 ub=1);

model X=Z;

run;

This model runs and gives me some results, however, the results are not consistent with another alternative way that I implemented. As I told Steve, even the simple regression (by ignoring the second model for endogenous variable) produces different results in proc Logistic. This is the alternative method:

Proc reg;

model X=Z;

output out=temp pred=Xhat;

run;

proc logistic data=temp;

model Y=Xhat;

run;

This is basically some variation of 2SLS method. I get the estimates of the endogenous variables from the first OLS regression and then use it in the second logistic regression.

Why do you this the two methods produce different results? Am I using proc QLIM correctly? Is using censored option the best way to handle proportion variables in QLIM? or Is there any problem with the second alternative method?

Thank you very much for your help;

Best

niam · Posted 06-17-2014 06:40 PM

Dear Steve and ets_kps;

If I use a binary dependent variable instead of a proportion variable, then proc QLIM (with /discrete (d=logit) option )and proc LOGISTIC produce identical results. So the problem is how to model proportion variables with QLIM, since the option of /censored(lb=0 ub=1) in proc QLIM does not produce similar results to the proc LOGISTIC results.

SteveDenham · Posted 06-18-2014 11:09 AM

While you can do the manual two-stage regression, doesn't it run into the problem that the amount of data (or complete cases, not sure which) that PROC QLIM complains about will result in unstable estimates with unacceptably large standard errors?

I don't think the censored option is a good way to proceed at this point. That leaves the manual 2SLS, or maybe a Bayesian approach in QLIM.

Steve Denham

niam · Posted 06-18-2014 02:12 PM

No, with Proc Logistic everything runs beautifully, how ever I am a little bit concerned with the validity of its results.

This is why I am concerned:

gunce_sas · Posted 06-18-2014 02:54 PM

I think, when the dependent variable is a fractional response variable, it should be modeled as truncated rather than censored. Because, with censoring with lower bound 0 and upper bound 1, you are saying that observations that are negative or bigger than 1 actually exist but you are not able to observe them in your sample. With truncation with lower bound 0 and upper bound 1, you are saying that the support of the distribution is [0, 1] and observations can't exist beyond these boundries. Hence,

Proc Qlim;

model Y=X / truncated(lb=0 ub=1);

model X=Z;

run;

may fit your data better.

When it comes to estimating this model with endogeneity using a two-step method, a control function method (which is also a two-step procedure) works BUT the procedure that you described earlier won't work. When you plug in the estimates of the endogenous variables obtained in the first step and estimate the nonlinear model in the second step will produce an inconsistent estimator. Instead of estimating for the endogenous variables you should estimate the error term of the reduced form model x=z; Here is what I mean:

X is endogenous if the error term of the structural model, say u, is correlated with that of the reduced form model, say e. We can model this as

u = theta v + e, where e is independent of v and theta is the correlation coefficient.

Therefore, you can write the model of interest (the structural model) as

Y = beta X + theta v + e

Y is fractional so it's a nonlinear model.

Now, v is unobserved, so it should be replaced with its estimate. This, you should obtain in the first step and then plug it in in the above model and estimate it appropriately. As far as I know, PROC LOGISTIC doesn't estimate fractional response variables, but I am not so sure, you may want to check on this.

niam · Posted 06-18-2014 03:58 PM

Dear Gunce@sas

Thanks for the detailed explanation.

To make sure I have understood correctly, this is the way you think the control function be implemented?

(Basically, instead of having the estimated endogenous variable, I should save the residuals and then use it along with the original endogenous variable in the final model)

Is this the case for every other type of the endogenous variables, or you are proposing this only for the logistic regression because of its non-linear format?

Proc reg;

model X=Z;

output out=temp pred=Xhat residual=vhat;

run;

proc logistic data=temp;

model Y=X vhat;

run;

By the way, in proc logistics, if you have the number of tirals and events, say N and S then you can write:

proc logistic;

model S/N=X;

run;

Actually, I have the trials and events and use the method above, but for the sake of brevity, I just said lets suppose that we have proportional dependent variable.

gunce_sas · Posted 06-19-2014 10:12 AM

Hi Niam,

Your understanding is correct, you should obtain the residuals for each reduced form model (as many as the number of endogenous explanatory variables) -- this makes up your first step-- and then insert them for the error term of the structural model and estimate -- this is the second step. A simple test on the coefficients of these residuals will give you a test of endogeneity.

You also asked a very good question that I should have explained before. The control function approach is used and valid when the model of interest (the structural model) is nonlinear and the endogenous explanatory variables are all continuous. Let me emphasize this one more time: When you estimate a nonlinear model with endogenous explanatory variables, the nature of the endogenous explanatory variables matters. For control function method to produce a consistent estimator, the corresponding reduced form equations must be linear. In your example this is the case, so you can use either a joint likelihood method, like in the QLIM example I wrote earlier, or a control function method.

niam · Posted 06-19-2014 12:12 PM

Perfect! Thank you very much!

SteveDenham · Posted 06-19-2014 09:50 AM

Thanks @gunce@sas! That was an excellent explanation and I can now say my knowledge of 2SLS is much improved. Well presented.

Steve Denham

Catch up on SAS Innovate 2026