Hello SAS experts,
My question is regarding multi-collinearity in logistic regression. I have two categorical and two continuous variables. I run the original model using PROC Logistic.
I wanted to run the full model (4 variables) including interactions, but the model becomes "saturated". I decided to run two separated analyses: 1) one for the two categorical variables + interactions and 2) the other for the continuous variables + interactions.
The problem with both analyses is the presence of multi-collinearity.
I think this may be a heresy, but, in order to show you how bad the multi-collinearity is, I run the analysis for the categorical variables in Minitab 19 using the "Binary Logistic Regression" tool, because it provides a compact table with the VIF for each categorical variable and their interactions and automatically shows the diagnostics for the model.
I am showing the results of the full analysis in the figures below.
The no interaction model for the categorical variables still shows multi-collinearity (VIF above 1 for all variables).
Given that multi-collinearity is caused by the explanatory variables being correlated, I thought the most simple solution for my data would be to run one logistic regression for each of the variables that I need to evaluate.
Is that an approach any of you could agree with? If not, is there any better solution you could suggest?
Thank you in advance. I apologize for the Minitab Output.
Regards,
Marcel
I thought the most simple solution for my data would be to run one logistic regression for each of the variables that I need to evaluate.
And then what? Now you have some logistic regressions, how do you continue the analysis to say what happens (or how the model predicts) using more than one (or maybe even all) the variables?
Please consider using Partial Least Squares Regression (PROC PLS), which is robust against multicollinearity. Oh ... wait ... that only works for continuous Y variables, not binary Y variables. You could use the Logistic Partial Least Squares method (https://cedric.cnam.fr/fichiers/RC906.pdf) which is robust against multicollinearity and works well in my experience, but no SAS code is available, although I think there is an R package which does this.
Thank you Paige Miller. First, I will try to figure out if the large VIFs are due to the quasi-complete separation of points I found in my data.
If there is collinearity among the predictors, it is important to determine what defines the collinearity. See this note which produces collinearity statistics for a logistic regression and examines the eigenvectors to determine the nature of the collinearity in the model.
Your comment is interesting. Because I have a quasi-complete separation of points. For that reason I used the Firth correction. So now I have to find out if the large VIFs are due to the quasi-complete separation or multi-collinearity.
I am having issues with the coding for my categorical variables to use them with proc genmod and proc reg.
My original table has this formatting:
In order to use it with proc genmod and proc reg I am planning to it code it like this:
That may not be the right way to do it?
Regards,
Marcel
@marcel wrote:
Your comment is interesting. Because I have a quasi-complete separation of points. For that reason I used the Firth correction. So now I have to find out if the large VIFs are due to the quasi-complete separation or multi-collinearity.
I am having issues with the coding for my categorical variables to use them with proc genmod and proc reg.
My original table has this formatting:
In order to use it with proc genmod and proc reg I am planning to it code it like this:
Recoding variable values from a b c ... to 1 2 3 ... makes not the slightest bit of difference if they are still CLASS variables. If you are thinking of turning the variables into continuous variables this way, I think that's a mistake unless they REALLY are continuous or ordinal.
You say you are having "issues", but you don't specify what the "issues" are, or how this re-coding changes anything.
VIF is for PROC REG. Check CORRB option. And @Rick_SAS wrote a blog about it - COV of estimator.
proc logistic data=sashelp.heart; class sex; model status=sex agechddiag ageatstart height weight diastolic/corrb; run;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.