Hello, everyone. I am currently building a multivariate linear regression. I found that significant collinearity exists among several independent variables and the intercept, with the largest condition index reaches to a staggering number of 90. Yet using the COLLINOINT option in MODEL statement of PROC REG, no collinearity was found when the intercept was not considered.
I have come to known that principal component analysis is one of the ways of tackling collinearity. However, after reading information about that method, I suddenly found out a possible limitation of that method: since (1) in the variable selection process following principal component analysis, it is the principal components, not the original independent variables that are selected, and (2) each and every principal component takes all the independent variables into account more or less, there is no way of "getting rid of" statistically insignificant variables in the variable selection process. I wonder if my notion were correct.
A second question that follows is: in the case of collinearities among several independent variables and the intercept (no collinearity among the independent variables themselves), is variable standardization (i.e. transforming the independent variables into variables with the same standard deviation via modules like PROC STANDARD) still a feasible method, as is the case in generalized linear models?
Thank you very much!
To answer your question directly: principal component analysis does not eliminate any of the original variables. It reduces the dimension of the problem by keeping only a small number of linear combinations.
I think most people would refer to PCA as a dimension reduction method, rather than a variable selection method. As you point out, a PCA model includes all of the original variables in the model. The model keeps the linear combinations that explain most of the variance in the model. The PCA regression uses some criterion for determining how many principal components to retain, then includes only a small number of PCs in the model.
Use Partial Least Squares (PROC PLS) for regression, not PCA, when there is collinearity. PLS is relatively robust against the effects of multi-collinearity. Randy Tobias (of SAS Institute) gives an example of PLS creating a useful model with 1,000 highly correlated variables, and he does not have to go through the step of variable selection.
PCA is the wrong tool for regression (even though there are many examples in the literature). It selects variables to have large loadings not based on whether or not they are good predictors, but based only upon the x-matrix. PLS selects variables to have large loadings based on whether or not they are good predictors, both the x-matrix and y-matrix are used.
Thank you for your advice! What about ridge regression and LASSO? What are their features (e.g. advantages and disadvantages) compared to partial least squares when it comes to dealing with collinearities in multivariate linear regression?
I really can't answer questions about the Lasso.
The paper by Frank and Friedman showed that PLS had lower mean square error of the predictions and lower mean square error of the regression coefficients (sometimes by an order of magnitude) compared to variable selection methods and compared to Ridge Regression and compared to Principal Components regression.
Many people have been trained that they MUST do variable selection (which they have also been trained to understand that this is time consuming and difficult to do), and that simply isn't true. There are probably thousands of published papers showing PLS creating useful models without variable selection; and the paper by Frank and Friedman shows that in most cases PLS produces a better model (as measured by the MSE of the predictions and MSE of the regression coefficients).
OK, thank you very much for your information, including the article you cited, advantages of PLS compared to other methods, and your opinion regarding variable selection!
To answer your question directly: principal component analysis does not eliminate any of the original variables. It reduces the dimension of the problem by keeping only a small number of linear combinations.
I think most people would refer to PCA as a dimension reduction method, rather than a variable selection method. As you point out, a PCA model includes all of the original variables in the model. The model keeps the linear combinations that explain most of the variance in the model. The PCA regression uses some criterion for determining how many principal components to retain, then includes only a small number of PCs in the model.
Hello,
Personally, I am also a supporter of PLS, like suggested by @PaigeMiller .
I am talking about "predictive Partial Least Squares regression" here (with one input block containing predictors and one output block containing response variables).
See PROC PLS and PROC PLSMOD.
With regard to Principal Components Analysis (PCA) ...
Note that in VIYA two interesting procedures were added :
Cheers,
Koen
Thank you, Koen, for kindly offering me help!
Your description on PLS is brief. Does "predictive Partial Least Squares regression" differ from other kinds of PLS?
You mentioned robust PCA and Kernel PCA. I wonder their differences (e.g. what goals can these methods but not "ordinary PCA" reach) as compared with "ordinary PCA" as well as the differences between the two methods.
Thank you very much!
Robust PCA uses robust estimates of the mean vector and covariance matrix, which means that the PCA is not unduly influenced by outliers in the data, which would otherwise bias the results.
Traditional PCA uses linear combinations of the original variables. Kernel PCA is a way to capture nonlinear combinations. It is mostly used for discriminant analysis and classification, which doesn't seem applicable to your situation.
😀Thank you for your explanation!👍
@Season wrote:
Does "predictive Partial Least Squares regression" differ from other kinds of PLS?
You mentioned robust PCA and Kernel PCA. I wonder their differences (e.g. what goals can these methods but not "ordinary PCA" reach) as compared with "ordinary PCA" as well as the differences between the two methods.
Note that the name "partial least squares" also applies to a more general statistical method that is not implemented in the procedures PLS , HPPLS and PLSMOD. The partial least squares method was originally developed in the 1960s by the econometrician Herman Wold (1966) for modeling "paths" of causal relation between any number of "blocks" of variables. However, the (HP)PLS and PLSMOD procedures fit only predictive partial least squares models, with one "block" of predictors and one "block" of responses. If you are interested in fitting more general path models, you should consider using the CALIS procedure.
The (R)(K)PCA question was already answered by @Rick_SAS .
And although not applicable for your use case ... here are two blogs on k-PCA :
Koen
And regarding introductory articles about robust PCA, I wrote about RPCA back in 2010, way before Viya or PROC RPCA were implemented. See p. 9-10 of Wicklin (2010) or the blog post from 2017, "Robust principal component analysis in SAS."
OK, thank you for your previous work and kindly offering them to me!
Thank you very, very much for your kind help, including a more detailed description of PLS and its history! I really appreciate your blending humanities and statistics, with introductions on the history of statistical methods, including those of PLS and joint model for longitudinal and time-to-event data (which you had pointed out previously) attached. For me, scientific history is not just a record of what happened in the past, but also a record of the trajectories of the development of sciences, from which I can summarize the pattern of existing scientific knowledge and disciplines, raise questions or even propose theories. I firmly believe that questions are key elements of the improvement of science, as is the case of the development of elliptic integrals while astronomers tried to calculate the circumference of orbits. In a word (two words, to be exact😂), thank you!
Still, I am still a green hand on PLS, so I hardly know anything about this method apart from its name. Your interpretation of PLS using "paths and blocks" somehow illuminates the method, but I am still not that clear about it. I am going to read more about PLS. I would like to raise a brief question for the sake of selecting a possible "shortcut": do you think that in the situation I encounter, only "predictive Partial Least Squares regression", but not other kinds of PLS, is suitable for reaching the goal I previously mentioned (tackle collinearity and conducting variable selection at the same time in a multivariate linear regression)? If so, maybe I do not need to know every kind of PLS to reach my goal.
I would like to raise a brief question for the sake of selecting a possible "shortcut": do you think that in the situation I encounter, only "predictive Partial Least Squares regression", but not other kinds of PLS, is suitable for reaching the goal I previously mentioned (tackle collinearity and conducting variable selection at the same time in a multivariate linear regression)? If so, maybe I do not need to know every kind of PLS to reach my goal.
Forget about path modeling / path analysis in PROC CALIS.
What you need is the kind of PLS as fit by PLS / HPPLS / PLSMOD procedures (predictive PLS regression with an input block and an output block). If I am right your output block has only one response variable, so you are doing multiple regression analysis and NOT multivariate regression!
PLS method is well suited to tackle problems of multicollinearity .
You can choose a proper PLS model using Cross-Validation or by Test Set Validation.
If the curse-of-dimensionality is "hitting hard" , you can consider to run the VARREDUCE procedure before running PLS.
The VARREDUCE procedure performs both supervised and unsupervised variable selection. It selects variables by identifying a set of variables that can jointly explain the maximum amount of data variance.
Cheers,
Koen
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.