Hi - I have a data set with both continuous and categorical variables and I am trying to figure out if I can remove any of them due to multicollinearity. I have tried using proc pls both pls and pcr methods, but I am not sure if this is the right way to go about it.
proc pls data=Vehicle details method=pls; class number_of_cylinders drive_wheels vehicle_type; model msrp=horsepower mpg_hwy weight wheelbase width drive_wheels/solution; run;
My output shows not much of a change to the dependent variable r square at 3 extracted factors:
So then I ran:
proc pls data=Vehicle nfac=3 details method=pls;
class number_of_cylinders drive_wheels vehicle_type;
model msrp=horsepower mpg_hwy weight wheelbase width drive_wheels/solution;
run;
I'm not sure at this point what to do with the output and how to determine which factors I can remove from the model. Any help would be appreciated! This is my first course using SAS so if I'm completely off base, please point me in the right direction, thanks.
In general, PROC PLS fits a model in such a way that it is robust to the effects of multi-collinearity, and so removing variables because of multi-collinearity is not necessary (although you can do that if you want to). See this article where a PLS model is fit on 1000 highly correlated variables and no variables are removed from the model, and the model still works very well, both in terms of predictability and in terms of interpreting the factors.
I'm not sure at this point what to do with the output and how to determine which factors I can remove from the model.
I thought you wanted to remove variables, not factors.
What to do with the output depends on what are your goals for this modeling, which you haven't told us. Do you want to get predicted values? Do you want to understand which variables are important in the predictions? Do you want to interpret the PLS factors?
@hovicke wrote:
Thanks, Paige. I guess I was confused about what proc pls actually does and did not realize variables did not have to be removed due to the multicollinearity. The objective was to find the “best” model from a set of data. I used proc glmselect for model selection and then wanted to test for multicollinearity using proc pls due to having both continuous and categorical variables and not being able to check vif with proc reg.
If I did want to remove variables using the proc pls output, how would I go about doing that? Or should that not even be done to fit a “best” model due to what proc pls does?
The idea of finding a "best" model is one that isn't really defined. If you fit a lot of models to a set of data, you could chose a "best" according to some criterion, but it may not be best under some other criterion. There are lots of criteria you could possibly use, and one criteria isn't actually available in SAS, but a study showed that PLS produces lower mean square error on its predicted values and regression coefficients than other methods in the study ... meaning the model is more robust and stable.
In my opinion, the whole idea is backwards of using PLS to figure out what variable to remove from PROC REG. You don't run PLS to fix the multicollinearity problems in PROC REG. You run PLS to get a model which is robust to multicollinearity that you can use instead of the PROC REG model.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.