I am currently building a logistic regression model whose dependent variable follows a binomial distribution. Based upon my professional knowledge, I assume that collinearity exists among the independent variables. Therefore, I wish to perform principal component analysis to detect possible collinearities and to lower the dimension of the independent variables. How can I do this via SAS? Thanks!
@Season wrote:
Thank you very, very much, Paige, for your kind help! Actually, I have not been that familiar with principal component analysis as well as PROC PRINCOMP. Therefore, I previously thought that PROC PRINCOMP only supports principal component analysis for models whose independent variable is a continuous one.
Principal components does not use a Y-variable. Therefore, you can use it on the X-variables with either continuous Y-variables or categorical Y-variables, it doesn't matter.
One issue that bothers me much is the lack of information on how to perform principal component analysis for logistic regression via SAS. Since SAS Help has not provided an example on how to perform principal component analysis for logistic regression and I retrieved no results for my question after browsing SAS Community Library, could you please provide some hint on the detailed procedure of doing so? Or perhaps a tutorial written by someone else?
It is no different than performing Principal Components for continuous Y. The Y-variable(s) are simply not used by PCA. As I stated above, (some of) the dimensions it finds may not be good predictors of Y.
PROC PRINCOMP will do this.
It will find reduced dimensions you can use, but CAUTION: some of those reduced dimensions may not be good predictors.
A better procedure, in my mind, is Logistic Partial Least Squares regression, which will find reduced dimensions that are good predictors (as good as the data will allow). While (non-logistic) Partial Least Squares regression is available in PROC PLS, Logistic Partial Least Squares is not available in SAS but is available as a package in R.
Thank you very, very much, Paige, for your kind help! Actually, I have not been that familiar with principal component analysis as well as PROC PRINCOMP. Therefore, I previously thought that PROC PRINCOMP only supports principal component analysis for models whose independent variable is a continuous one.
One issue that bothers me much is the lack of information on how to perform principal component analysis for logistic regression via SAS. Since SAS Help has not provided an example on how to perform principal component analysis for logistic regression and I retrieved no results for my question after browsing SAS Community Library, could you please provide some hint on the detailed procedure of doing so? Or perhaps a tutorial written by someone else?
Many thanks!
@Season wrote:
Thank you very, very much, Paige, for your kind help! Actually, I have not been that familiar with principal component analysis as well as PROC PRINCOMP. Therefore, I previously thought that PROC PRINCOMP only supports principal component analysis for models whose independent variable is a continuous one.
Principal components does not use a Y-variable. Therefore, you can use it on the X-variables with either continuous Y-variables or categorical Y-variables, it doesn't matter.
One issue that bothers me much is the lack of information on how to perform principal component analysis for logistic regression via SAS. Since SAS Help has not provided an example on how to perform principal component analysis for logistic regression and I retrieved no results for my question after browsing SAS Community Library, could you please provide some hint on the detailed procedure of doing so? Or perhaps a tutorial written by someone else?
It is no different than performing Principal Components for continuous Y. The Y-variable(s) are simply not used by PCA. As I stated above, (some of) the dimensions it finds may not be good predictors of Y.
Ok, thank you very much for your help. Actually, the current model I have built works not bad. Still, some of the parameters that have been proved to be associated with the independent variables by professional knowledge have been tested as statistically insignificant in my analysis. Therefore, for the sake of improving my model, I have come to seek help to examine if the insignificances were caused by collinearities, by the lack of samples, or by other issues (e.g. outliers).
You have repeated reminded me that in the circumstance I am consulting, principal component analysis may not be the best choice. Thank you for your reminder. Actually, I have only systematically studied statistics and the mathematical knowledge it bases upon for an entire year. Therefore, I can only use SAS right now. I will try Logistic Partial Least Squares method if principal component analysis failed to tackle this problem.
Thank you very much again!
Another thing to consider is a penalty-based model selection process such as LASSO which is available in PROC HPGENSELECT and selects a subset of the candidate predictors rather than combine them all into a small number of functions. Also, note that if the concern is more about collinearity causing ill-conditioning of the information matrix used in the model-fitting process than dimension reduction of your predictors, then that can be addressed as discussed in this note.
Thank you for the help you have offered! Currently, my most important concern is the diagnostics of collinearity. I will take a look at LASSO and the note you have mentioned.
I have another question on the note you provided. Should VIF computed with the weighted information matrix still be called "VIF"; or "GVIF", as another user of SAS Community had mentioned?
Thank you!
Oh, by the way, I have another problem concerning the diagnostics (discovery) of collinearities in logistic regression. In linear regression models, tolerance, variance inflation factor (VIF), as well as condition index (computed from eigenvalues) can serve as indicators of collinearities among the independent variables in the model. The aforementioned three statistics can be computed in PROC REG upon request. However, they are not available in the modules that build logistic regression models (i.e. PROC LOGISTIC, PROC GENMOD, PROC HPLOGISTIC, etc.). Therefore, diagnostics of collinearity in logistic regression is not that easy.
I tried PROC PRINCOMP in my data today and found out that PROC PRINCOMP does not compute the three statistics either. Instead, it produces a correlation matrix of the variables I wish to analyze. There is no surprise that "strong" correlations exist among the variables I put in the logistic regression model, with some of the correlation coefficient reaching 0.6154. I guess that collinearities must exist in this situation.
So here are my questions: when it comes to diagnostics of collinearity, can correlation coefficients serve as surrogate statistics for tolerance, VIF and condition index in logistic regression? If not, what statistic(s) can do this job? Also, how can I compute tolerance, VIF and condition index in logistic regression?
Could @PaigeMiller, @StatDave or someone else kindly give me a hand?
Thank you all very much!
OK, thank you very much for your help! I will read the note you have mentioned carefully and try LASSO as well to compare the two methods. It's too bad that SAS Community only supports accepting merely one reply as the solution. I think that your replies and the replies given by @PaigeMiller are all very fruitful for not only me, but also all of those that are troubled by collinearity in logistic regression. After all, I have retrieved nearly zero article discussing the solution of the collinearity problem in my search for articles on the Internet. Instead of discussing much about mathematical or statistical theories prior to providing a solution (like most articles do), your replies get straight to the point-- provide answers to the problem directly. I myself deem your replies as wonderful "concise textbooks" to the problem. I am sure that your replies can benefit other researchers who are struggling to find a solution to that problem and spending much time on searching for information instead of data analysis itself.
By the way, I major in medicine and is familiar with a few search engines that specialize in searching for articles on medicine (e.g. PubMed). Could you please introduce the search engine statisticians frequently use (aside from Google Scholar) or a few prestigious journals on statistics?
Thank you both for your kind help again!
To use the VIF in PROC REG, you create a made up variable that is a continuous Y and use your X-variables. The VIF does not depend on the Y variable.
OK, I see. Computing VIF in PROC REG when the dependent variable is a continuous one is easy. Yet the question I raised earlier is the computation of VIF in a logistic regression model. Can SAS do that? Thanks!
Thank you for your kind and repetitive reminder. In fact, I had just begun reading the note you mentioned when I was replying to @PaigeMiller yesterday. I am now fully informed of the fact that weights should be multiplied when it comes to diagnosing collinearity in generalized linear models.
Still, I have some questions:
(1) I noticed that the var argument of PROC STANDARD standardizes all of the independent variables in the logistic model (li, temp and cell). Now that collinearity exists only between variable temp and the intercept, does all of the independent variables have to be standardized?
(2) The means of obliterating (or at least reducing) collinearity in a logistic regression model demonstrated here is variable standardization. In a complete model building process, what follows the PROC STANDARD procedure is using these standardized variables to perform logistic regression modeling. Eventually, the user may wish to transform the standardized variables into unstandardized ones. When I was a student studying statistics, my teacher demonstrated an example of using SAS to perform principal component analysis for multivariate linear regression. She completed the final process (i.e. transform the standardized variables back to the unstandardized ones after the entire model building process) by writing down the equation in hand and perform arithmetic calculations on her own.
Is there an automatic way of doing that final transformation process by SAS?
(3) The circumstance illustrated in the note you provided was one where one independent variable collinears with the intercept. What if the independent variables collinear with each other? Aside from deviating from the original model (i.e. switching to penalty-based model selection process like LASSO or other methods like Logistic Partial Least Squares Regression, etc.) and simply deleting one or more variables involved in collinearity, is variable standardization still a solution to that problem? If so, should the researcher standardize all the independent variables, as is the case in the note you provided; or just the independent variables that are involved in collinearity?
Many thanks!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.