Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- How can I perform principal component analysis for logistic regression...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

☑ This topic is **solved**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 02-24-2023 09:22 PM
(4598 views)

I am currently building a logistic regression model whose dependent variable follows a binomial distribution. Based upon my professional knowledge, I assume that collinearity exists among the independent variables. Therefore, I wish to perform principal component analysis to detect possible collinearities and to lower the dimension of the independent variables. How can I do this via SAS? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Season wrote:

Thank you very, very much, Paige, for your kind help! Actually, I have not been that familiar with principal component analysis as well as PROC PRINCOMP. Therefore, I previously thought that PROC PRINCOMP only supports principal component analysis for models whose independent variable is a continuous one.

Principal components does not use a Y-variable. Therefore, you can use it on the X-variables with either continuous Y-variables or categorical Y-variables, it doesn't matter.

One issue that bothers me much is the lack of information on how to perform principal component analysis for logistic regression via SAS. Since SAS Help has not provided an example on how to perform principal component analysis for logistic regression and I retrieved no results for my question after browsing SAS Community Library, could you please provide some hint on the detailed procedure of doing so? Or perhaps a tutorial written by someone else?

It is no different than performing Principal Components for continuous Y. The Y-variable(s) are simply not used by PCA. As I stated above, (some of) the dimensions it finds may not be good predictors of Y.

--

Paige Miller

Paige Miller

33 REPLIES 33

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

PROC PRINCOMP will do this.

It will find reduced dimensions you can use, but CAUTION: some of those reduced dimensions may not be good predictors.

A better procedure, in my mind, is Logistic Partial Least Squares regression, which will find reduced dimensions that are good predictors (as good as the data will allow). While (non-logistic) Partial Least Squares regression is available in PROC PLS, Logistic Partial Least Squares is not available in SAS but is available as a package in R.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you very, very much, Paige, for your kind help! Actually, I have not been that familiar with principal component analysis as well as PROC PRINCOMP. Therefore, I previously thought that PROC PRINCOMP only supports principal component analysis for models whose independent variable is a continuous one.

One issue that bothers me much is the lack of information on how to perform principal component analysis for logistic regression via SAS. Since SAS Help has not provided an example on how to perform principal component analysis for logistic regression and I retrieved no results for my question after browsing SAS Community Library, could you please provide some hint on the detailed procedure of doing so? Or perhaps a tutorial written by someone else?

Many thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Season wrote:

Thank you very, very much, Paige, for your kind help! Actually, I have not been that familiar with principal component analysis as well as PROC PRINCOMP. Therefore, I previously thought that PROC PRINCOMP only supports principal component analysis for models whose independent variable is a continuous one.

Principal components does not use a Y-variable. Therefore, you can use it on the X-variables with either continuous Y-variables or categorical Y-variables, it doesn't matter.

One issue that bothers me much is the lack of information on how to perform principal component analysis for logistic regression via SAS. Since SAS Help has not provided an example on how to perform principal component analysis for logistic regression and I retrieved no results for my question after browsing SAS Community Library, could you please provide some hint on the detailed procedure of doing so? Or perhaps a tutorial written by someone else?

It is no different than performing Principal Components for continuous Y. The Y-variable(s) are simply not used by PCA. As I stated above, (some of) the dimensions it finds may not be good predictors of Y.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Ok, thank you very much for your help. Actually, the current model I have built works not bad. Still, some of the parameters that have been proved to be associated with the independent variables by professional knowledge have been tested as statistically insignificant in my analysis. Therefore, for the sake of improving my model, I have come to seek help to examine if the insignificances were caused by collinearities, by the lack of samples, or by other issues (e.g. outliers).

You have repeated reminded me that in the circumstance I am consulting, principal component analysis may not be the best choice. Thank you for your reminder. Actually, I have only systematically studied statistics and the mathematical knowledge it bases upon for an entire year. Therefore, I can only use SAS right now. I will try Logistic Partial Least Squares method if principal component analysis failed to tackle this problem.

Thank you very much again!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Tags:
- collinearity

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Tags:
- collinearity
- lasso

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I have another question on the note you provided. Should VIF computed with the weighted information matrix still be called "VIF"; or "GVIF", as another user of SAS Community had mentioned?

Thank you!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Oh, by the way, I have another problem concerning the diagnostics (discovery) of collinearities in logistic regression. In linear regression models, tolerance, variance inflation factor (VIF), as well as condition index (computed from eigenvalues) can serve as indicators of collinearities among the independent variables in the model. The aforementioned three statistics can be computed in PROC REG upon request. However, they are not available in the modules that build logistic regression models (i.e. PROC LOGISTIC, PROC GENMOD, PROC HPLOGISTIC, etc.). Therefore, diagnostics of collinearity in logistic regression is not that easy.

I tried PROC PRINCOMP in my data today and found out that PROC PRINCOMP does not compute the three statistics either. Instead, it produces a correlation matrix of the variables I wish to analyze. There is no surprise that "strong" correlations exist among the variables I put in the logistic regression model, with some of the correlation coefficient reaching 0.6154. I guess that collinearities must exist in this situation.

So here are my questions: when it comes to diagnostics of collinearity, can correlation coefficients serve as surrogate statistics for tolerance, VIF and condition index in logistic regression? If not, what statistic(s) can do this job? Also, how can I compute tolerance, VIF and condition index in logistic regression?

Could @PaigeMiller, @StatDave or someone else kindly give me a hand?

Thank you all very much!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The note I referred you to in my last post specifically discusses and shows how to get collinearity diagnostics for a logistic (or other generalized linear model). I suggest you read the collinearity section of that note and use the method shown. As noted there, correlation among your predictors by themselves is not necessarily a problem. But as I also mentioned, you might not even need to bother with the diagnostics if you use the penalty-based LASSO selection method to just pick out the important predictors.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

OK, thank you very much for your help! I will read the note you have mentioned carefully and try LASSO as well to compare the two methods. It's too bad that SAS Community only supports accepting merely one reply as the solution. I think that your replies and the replies given by @PaigeMiller are all very fruitful for not only me, but also all of those that are troubled by collinearity in logistic regression. After all, I have retrieved nearly zero article discussing the solution of the collinearity problem in my search for articles on the Internet. Instead of discussing much about mathematical or statistical theories prior to providing a solution (like most articles do), your replies get straight to the point-- provide answers to the problem directly. I myself deem your replies as wonderful "concise textbooks" to the problem. I am sure that your replies can benefit other researchers who are struggling to find a solution to that problem and spending much time on searching for information instead of data analysis itself.

By the way, I major in medicine and is familiar with a few search engines that specialize in searching for articles on medicine (e.g. PubMed). Could you please introduce the search engine statisticians frequently use (aside from Google Scholar) or a few prestigious journals on statistics?

Thank you both for your kind help again!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

To use the VIF in PROC REG, you create a made up variable that is a continuous Y and use your X-variables. The VIF does not depend on the Y variable.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If you read through the note I referred to then you should have learned that collinearity diagnostics (like VIF) for a logistic model (or any generalized linear model) requires using appropriate weights in PROC REG. As is specifically illustrated for a logistic model in that note, the weights can be obtained by first fitting the model in PROC GENMOD and saving the HESSWGT= values. When you then fit the model (to any response values) in PROC REG using those weights, you get the appropriate collinearity diagnostics for assessing your logistic model.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for your kind and repetitive reminder. In fact, I had just begun reading the note you mentioned when I was replying to @PaigeMiller yesterday. I am now fully informed of the fact that weights should be multiplied when it comes to diagnosing collinearity in generalized linear models.

Still, I have some questions:

(1) I noticed that the var argument of PROC STANDARD standardizes all of the independent variables in the logistic model (li, temp and cell). Now that collinearity exists only between variable temp and the intercept, does all of the independent variables have to be standardized?

(2) The means of obliterating (or at least reducing) collinearity in a logistic regression model demonstrated here is variable standardization. In a complete model building process, what follows the PROC STANDARD procedure is using these standardized variables to perform logistic regression modeling. Eventually, the user may wish to transform the standardized variables into unstandardized ones. When I was a student studying statistics, my teacher demonstrated an example of using SAS to perform principal component analysis for multivariate linear regression. She completed the final process (i.e. transform the standardized variables back to the unstandardized ones after the entire model building process) by writing down the equation in hand and perform arithmetic calculations on her own.

Is there an automatic way of doing that final transformation process by SAS?

(3) The circumstance illustrated in the note you provided was one where one independent variable collinears with the intercept. What if the independent variables collinear with each other? Aside from deviating from the original model (i.e. switching to penalty-based model selection process like LASSO or other methods like Logistic Partial Least Squares Regression, etc.) and simply deleting one or more variables involved in collinearity, is variable standardization still a solution to that problem? If so, should the researcher standardize all the independent variables, as is the case in the note you provided; or just the independent variables that are involved in collinearity?

Many thanks!

**SAS Innovate 2025** is scheduled for May 6-9 in Orlando, FL. Sign up to be **first to learn** about the agenda and registration!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.