BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Season
Lapis Lazuli | Level 10

Hello, everyone. I am currently building a multivariate linear regression. I found that significant collinearity exists among several independent variables and the intercept, with the largest condition index reaches to a staggering number of 90. Yet using the COLLINOINT option in MODEL statement of PROC REG, no collinearity was found when the intercept was not considered.

I have come to known that principal component analysis is one of the ways of tackling collinearity. However, after reading information about that method, I suddenly found out a possible limitation of that method: since (1) in the variable selection process following principal component analysis, it is the principal components, not the original independent variables that are selected, and (2) each and every principal component takes all the independent variables into account more or less, there is no way of "getting rid of" statistically insignificant variables in the variable selection process. I wonder if my notion were correct.

A second question that follows is: in the case of collinearities among several independent variables and the intercept (no collinearity among the independent variables themselves), is variable standardization (i.e. transforming the independent variables into variables with the same standard deviation via modules like PROC STANDARD) still a feasible method, as is the case in generalized linear models?

Thank you very much! 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

To answer your question directly: principal component analysis does not eliminate any of the original variables. It reduces the dimension of the problem by keeping only a small number of linear combinations.

 

I think most people would refer to PCA as a dimension reduction method, rather than a variable selection method. As you point out, a PCA model includes all of the original variables in the model. The model keeps the linear combinations that explain most of the variance in the model. The PCA regression uses some criterion for determining how many principal components to retain, then includes only a small number of PCs in the model. 

 

View solution in original post

29 REPLIES 29
PaigeMiller
Diamond | Level 26

Use Partial Least Squares (PROC PLS) for regression, not PCA, when there is collinearity. PLS is relatively robust against the effects of multi-collinearity. Randy Tobias (of SAS Institute) gives an example of PLS creating a useful model with 1,000 highly correlated variables, and he does not have to go through the step of variable selection.

 

PCA is the wrong tool for regression (even though there are many examples in the literature). It selects variables to have large loadings not based on whether or not they are good predictors, but based only upon the x-matrix. PLS selects variables to have large loadings based on whether or not they are good predictors, both the x-matrix and y-matrix are used.

--
Paige Miller
Season
Lapis Lazuli | Level 10

Thank you for your advice! What about ridge regression and LASSO? What are their features (e.g. advantages and disadvantages) compared to partial least squares when it comes to dealing with collinearities in multivariate linear regression?

PaigeMiller
Diamond | Level 26

I really can't answer questions about the Lasso.

 

The paper by Frank and Friedman showed that PLS had lower mean square error of the predictions and lower mean square error of the regression coefficients (sometimes by an order of magnitude) compared to variable selection methods and compared to Ridge Regression and compared to Principal Components regression.

 

Many people have been trained that they MUST do variable selection (which they have also been trained to understand that this is time consuming and difficult to do), and that simply isn't true. There are probably thousands of published papers showing PLS creating useful models without variable selection; and the paper by Frank and Friedman shows that in most cases PLS produces a better model (as measured by the MSE of the predictions and MSE of the regression coefficients).

 

 

--
Paige Miller
Season
Lapis Lazuli | Level 10

OK, thank you very much for your information, including the article you cited, advantages of PLS compared to other methods, and your opinion regarding variable selection!

Rick_SAS
SAS Super FREQ

To answer your question directly: principal component analysis does not eliminate any of the original variables. It reduces the dimension of the problem by keeping only a small number of linear combinations.

 

I think most people would refer to PCA as a dimension reduction method, rather than a variable selection method. As you point out, a PCA model includes all of the original variables in the model. The model keeps the linear combinations that explain most of the variance in the model. The PCA regression uses some criterion for determining how many principal components to retain, then includes only a small number of PCs in the model. 

 

sbxkoenk
SAS Super FREQ

Hello,

 

Personally, I am also a supporter of PLS, like suggested by @PaigeMiller .

I am talking about "predictive Partial Least Squares regression" here (with one input block containing predictors and one output block containing response variables). 
See PROC PLS and PROC PLSMOD.

 

With regard to Principal Components Analysis (PCA) ...
Note that in VIYA two interesting procedures were added :

  • PROC RPCA (Robust PCA)
  • PROC KPCA (Kernel PCA)

Cheers,

Koen

Season
Lapis Lazuli | Level 10

Thank you, Koen, for kindly offering me help!

Your description on PLS is brief. Does "predictive Partial Least Squares regression" differ from other kinds of PLS?

You mentioned robust PCA and Kernel PCA. I wonder their differences (e.g. what goals can these methods but not "ordinary PCA" reach) as compared with "ordinary PCA" as well as the differences between the two methods.

Thank you very much!

Rick_SAS
SAS Super FREQ

Robust PCA uses robust estimates of the mean vector and covariance matrix, which means that the PCA is not unduly influenced by outliers in the data, which would otherwise bias the results.

 

Traditional PCA uses linear combinations of the original variables. Kernel PCA is a way to capture nonlinear combinations. It is mostly used for discriminant analysis and classification, which doesn't seem applicable to your situation.

Season
Lapis Lazuli | Level 10

😀Thank you for your explanation!👍

sbxkoenk
SAS Super FREQ

@Season wrote:

Does "predictive Partial Least Squares regression" differ from other kinds of PLS?

You mentioned robust PCA and Kernel PCA. I wonder their differences (e.g. what goals can these methods but not "ordinary PCA" reach) as compared with "ordinary PCA" as well as the differences between the two methods.


Note that the name "partial least squares" also applies to a more general statistical method that is not implemented in the procedures PLS , HPPLS and PLSMOD. The partial least squares method was originally developed in the 1960s by the econometrician Herman Wold (1966) for modeling "paths" of causal relation between any number of "blocks" of variables. However, the (HP)PLS and PLSMOD procedures fit only predictive partial least squares models, with one "block" of predictors and one "block" of responses. If you are interested in fitting more general path models, you should consider using the CALIS procedure.

The (R)(K)PCA question was already answered by @Rick_SAS .

And although not applicable for your use case ... here are two blogs on k-PCA :

Koen

Rick_SAS
SAS Super FREQ

And regarding introductory articles about robust PCA, I wrote about RPCA back in 2010, way before Viya or PROC RPCA were implemented. See p. 9-10 of Wicklin (2010) or the blog post  from 2017, "Robust principal component analysis in SAS."

Season
Lapis Lazuli | Level 10

OK, thank you for your previous work and kindly offering them to me!

Season
Lapis Lazuli | Level 10

Thank you very, very much for your kind help, including a more detailed description of PLS and its history! I really appreciate your blending humanities and statistics, with introductions on the history of statistical methods, including those of PLS and joint model for longitudinal and time-to-event data (which you had pointed out previously) attached. For me, scientific history is not just a record of what happened in the past, but also a record of the trajectories of the development of sciences, from which I can summarize the pattern of existing scientific knowledge and disciplines, raise questions or even propose theories. I firmly believe that questions are key elements of the improvement of science, as is the case of the development of elliptic integrals while astronomers tried to calculate the circumference of orbits. In a word (two words, to be exact😂), thank you!

Still, I am still a green hand on PLS, so I hardly know anything about this method apart from its name. Your interpretation of PLS using "paths and blocks" somehow illuminates the method, but I am still not that clear about it. I am going to read more about PLS. I would like to raise a brief question for the sake of selecting a possible "shortcut": do you think that in the situation I encounter, only "predictive Partial Least Squares regression", but not other kinds of PLS, is suitable for reaching the goal I previously mentioned (tackle collinearity and conducting variable selection at the same time in a multivariate linear regression)? If so, maybe I do not need to know every kind of PLS to reach my goal.

sbxkoenk
SAS Super FREQ

I would like to raise a brief question for the sake of selecting a possible "shortcut": do you think that in the situation I encounter, only "predictive Partial Least Squares regression", but not other kinds of PLS, is suitable for reaching the goal I previously mentioned (tackle collinearity and conducting variable selection at the same time in a multivariate linear regression)? If so, maybe I do not need to know every kind of PLS to reach my goal.

Forget about path modeling / path analysis in PROC CALIS.
What you need is the kind of PLS as fit by PLS / HPPLS / PLSMOD procedures (predictive PLS regression with an input block and an output block). If I am right your output block has only one response variable, so you are doing multiple regression analysis and NOT multivariate regression!
PLS method is well suited to tackle problems of multicollinearity .
You can choose a proper PLS model using Cross-Validation or by Test Set Validation.

If the curse-of-dimensionality is "hitting hard" , you can consider to run the 
VARREDUCE procedure before running PLS.

The VARREDUCE procedure performs both supervised and unsupervised variable selection. It selects variables by identifying a set of variables that can jointly explain the maximum amount of data variance.

Cheers,
Koen

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 29 replies
  • 1663 views
  • 22 likes
  • 5 in conversation