Multicollinearity in Multiple Linear Regression with Continuous and Ca...

hovicke · Posted 11-14-2022 05:27 PM

Hi - I have a data set with both continuous and categorical variables and I am trying to figure out if I can remove any of them due to multicollinearity. I have tried using proc pls both pls and pcr methods, but I am not sure if this is the right way to go about it.

proc pls data=Vehicle details method=pls;
class number_of_cylinders drive_wheels vehicle_type;
model msrp=horsepower mpg_hwy weight wheelbase width drive_wheels/solution;
run;

My output shows not much of a change to the dependent variable r square at 3 extracted factors:

So then I ran:

proc pls data=Vehicle nfac=3 details method=pls;
class number_of_cylinders drive_wheels vehicle_type;
model msrp=horsepower mpg_hwy weight wheelbase width drive_wheels/solution;
run;

I'm not sure at this point what to do with the output and how to determine which factors I can remove from the model. Any help would be appreciated! This is my first course using SAS so if I'm completely off base, please point me in the right direction, thanks.

PaigeMiller · Posted 11-15-2022 08:42 AM

In general, PROC PLS fits a model in such a way that it is robust to the effects of multi-collinearity, and so removing variables because of multi-collinearity is not necessary (although you can do that if you want to). See this article where a PLS model is fit on 1000 highly correlated variables and no variables are removed from the model, and the model still works very well, both in terms of predictability and in terms of interpreting the factors.

I'm not sure at this point what to do with the output and how to determine which factors I can remove from the model.

I thought you wanted to remove variables, not factors.

What to do with the output depends on what are your goals for this modeling, which you haven't told us. Do you want to get predicted values? Do you want to understand which variables are important in the predictions? Do you want to interpret the PLS factors?

--
Paige Miller

hovicke · Posted 11-15-2022 09:02 AM

Thanks, Paige. I guess I was confused about what proc pls actually does and did not realize variables did not have to be removed due to the multicollinearity. The objective was to find the “best” model from a set of data. I used proc glmselect for model selection and then wanted to test for multicollinearity using proc pls due to having both continuous and categorical variables and not being able to check vif with proc reg.

If I did want to remove variables using the proc pls output, how would I go about doing that? Or should that not even be done to fit a “best” model due to what proc pls does?

PaigeMiller · Posted 11-15-2022 09:37 AM

@hovicke wrote:
Thanks, Paige. I guess I was confused about what proc pls actually does and did not realize variables did not have to be removed due to the multicollinearity. The objective was to find the “best” model from a set of data. I used proc glmselect for model selection and then wanted to test for multicollinearity using proc pls due to having both continuous and categorical variables and not being able to check vif with proc reg.

If I did want to remove variables using the proc pls output, how would I go about doing that? Or should that not even be done to fit a “best” model due to what proc pls does?

The idea of finding a "best" model is one that isn't really defined. If you fit a lot of models to a set of data, you could chose a "best" according to some criterion, but it may not be best under some other criterion. There are lots of criteria you could possibly use, and one criteria isn't actually available in SAS, but a study showed that PLS produces lower mean square error on its predicted values and regression coefficients than other methods in the study ... meaning the model is more robust and stable.

In my opinion, the whole idea is backwards of using PLS to figure out what variable to remove from PROC REG. You don't run PLS to fix the multicollinearity problems in PROC REG. You run PLS to get a model which is robust to multicollinearity that you can use instead of the PROC REG model.

--
Paige Miller

Multicollinearity in Multiple Linear Regression with Continuous and Categorical Variables

Re: Multicollinearity in Multiple Linear Regression with Continuous and Categorical Variables

Re: Multicollinearity in Multiple Linear Regression with Continuous and Categorical Variables

Re: Multicollinearity in Multiple Linear Regression with Continuous and Categorical Variables

Multicollinearity in Multiple Linear Regression with Continuous and Categorical Variables

Re: Multicollinearity in Multiple Linear Regression with Continuous and Categorical Variables

Re: Multicollinearity in Multiple Linear Regression with Continuous and Categorical Variables

Re: Multicollinearity in Multiple Linear Regression with Continuous and Categorical Variables

SAS Innovate 2025: Save the Date