Re: Using PCA for modeling and turning back the coefficients

chemicalab · Posted 08-08-2014 07:37 AM

Hi ,

How would i proceed if due to multicollinearity between variables i was to use pca, derive some components, use them to run the model, how would i be able to reverse the coefficients in order to get the true effect of the variables in the model?

Thank you in advance

SteveDenham · Posted 08-08-2014 08:55 AM

I think you are into an area that the research is still up in the air. Time series on principal components strikes me as a very difficult, but interesting, area. What does PROC PANEL give you (or not give you)? It deals with multivariate time series.

My fear about the PCA approach is that the loadings on the components will differ at each time point, meaning that you aren't really looking at the same "variables". I don't know enough about copulas (and PROC COPULA) to suggest them as an approach, but the theory there is much more developed.

Steve Denham

PaigeMiller · Posted 08-08-2014 08:57 AM

If you want to derive components for a regression model, then PROC PLS will do a better job than Principal Components. The DETAILS option in PROC PLS computes the regression coefficients for you.

You could also use the METHOD=PCR option in PROC PLS to force the procedure to provide Principal Components Regression and related components and model coefficients, but I would not recommend this, as the PLS model (and components) will fit better.

--
Paige Miller

stat_sas · Posted 08-08-2014 09:21 AM

Hi,

I would suggest use proc varclus to retain original variables in the model instead of components. This will help to reduce multicollinearity as well as to measure true effect of original variables.

PaigeMiller · Posted 08-08-2014 09:42 AM

The problem with PCA, and the problem with VARCLUS, in this situation, is that they find combinations of predictor variables that may or may not be predictive of the response variable(s). PLS specifically tries to find components that are predictive of the response variable(s) and hence will product better fits.

--
Paige Miller

stat_sas · Posted 08-08-2014 10:14 AM

Thanks PaigeMiller - PLS finds components that are highly correlated with response variable. What if we want to see significance of original variables in the model? Is that possible using PLS please?

PaigeMiller · Posted 08-08-2014 10:26 AM

The whole idea of finding significance of original variables in a situations where the predictors are multicollinear seems to me to be the wrong question to ask.

You will always be misled in this multicollinear situation by asking which are the "real" or "significant" predictors. It is impossible to tell, empirically.

So PLS doesn't answer the question. It gives you a (hopefully) good predictive model, and in many situations, it also gives interpretable loadings to help you understand what combinations of variables are predictive.

--
Paige Miller

stat_sas · Posted 08-08-2014 10:42 AM

Thanks - I think question was asked to determine the significance of original variables in terms of coefficients in the model. Loadings can tell us about the strength of association among factors and variables but can not measure significance of a predictor in explaining a response variable. Also, I am not sure what type of rotation involved in PLS that will make the interpretation even more complicated.

PaigeMiller · Posted 08-08-2014 10:52 AM

... but can not measure significance of a predictor in explaining a response variable

correct, it cannot do this, because logically, in the multicollinearity situation, this idea of "significance of a predictor" makes no sense

I am not sure what type of rotation involved in PLS that will make the interpretation even more complicated.

No rotation is used

--
Paige Miller

stat_sas · Posted 08-08-2014 12:40 PM

Thanks again!

This is going to be a interesting discussion. Without rotation, if one factor is evenly loaded on two or more variables, how can we decide which of these variables more predictive?

PaigeMiller · Posted 08-08-2014 04:17 PM

stat@sas wrote:

This is going to be a interesting discussion. Without rotation, if one factor is evenly loaded on two or more variables, how can we decide which of these variables more predictive?

I keep saying, you can't do this. PLS reports combinations of variables are predictive. PLS does not attempt to single out an individual variable. Nor should you attempt to single out an individual variable.

Furthermore, in the case of multicollinearity, you cannot logically single out a variable to be "more predictive". For example, you have X1, X2 and X3 all with correlations of about 0.8 with each other. You also have Y, predicted by X1, X2 and X3. Can you say using any logical method that X1 is the variable that is "more predictive" if X2 and X3 are moving together with X1? No of course not. You may run a statistical procedure that reports slopes and statistical significances and one of those will be the "winner", but that doesn't take into account the logical impossibility of separating the three effects which are correlated into a single "winner". Thus, PLS reports the combination of X1 X2 and X3 is predictive, and does not single one out. Ordinary Least Squares regression fails miserably in this situation (although the algorithm will certainly produce results)

--
Paige Miller

stat_sas · Posted 08-08-2014 04:26 PM

Thanks PaigeMiller.

tyang · Posted 02-16-2015 05:18 AM

Hello! I found this discussion very interesting and am trying to apply PROC PLS because it is so useful in using combinations of variables which are multicollinear. The issue I am coming up against is that I am unable to figure out how to extract the "factors" to use in a regression model. Whereas PROC FACTOR can be used with PROC SCORE to extract the factors, there doesn't seem to be an analogous procedure that works with PROC PLS.

PaigeMiller · Posted 02-17-2015 12:12 PM

I'm not sure exactly what you mean by "how to extract the factors", can you explain further what these "factors" are (since "factors" is not really a term used in PLS).

If by "factors" you mean "loadings", then its easy to obtain those from PROC PLS. If you mean something else, then you need to explain what you mean.

--
Paige Miller

tyang · Posted 02-20-2015 06:36 AM

The "factors" I am interested in are the "Number of Extracted Factors" that is displayed as part of the "Percent Variation Accounted for by Partial Least Squares Factors".

With PROC FACTOR, you can OUTSTAT a dataset which can be used by PROC SCORE to generate "factors" to use in a regression analysis. I want to do the same in terms of extracting these "factors" from PROC PLS.

Ready to join fellow brilliant minds for the SAS Hackathon?