BookmarkSubscribeRSS Feed
LucyB
Obsidian | Level 7

I am running a multiple linear regression model and I have 8 covariates, 4 of them are highly correlated (r>0.7). So I created z scores and then created a composite. When I re-ran the model with this composite, my predictor p-value became significantly larger and my R2 went down. Why is this happening? I thought p-values decreased after accounting for multicollinearity?

11 REPLIES 11
Reeza
Super User

It's more than just the P-Value. 

 

An excerpt from Wikipedia that's relevant here:

 

So long as the underlying specification is correct, multicollinearity does not actually bias results; it just produces large standard errors in the related independent variables. More importantly, the usual use of regression is to take coefficients from the model and then apply them to other data. Since multicollinearity causes imprecise estimates of coefficient values, the resulting out-of-sample predictions will also be imprecise. And if the pattern of multicollinearity in the new data differs from that in the data that was fitted, such extrapolation may introduce large errors in the predictions.

PaigeMiller
Diamond | Level 26

@LucyB wrote:

I am running a multiple linear regression model and I have 8 covariates, 4 of them are highly correlated (r>0.7). So I created z scores and then created a composite. When I re-ran the model with this composite, my predictor p-value became significantly larger and my R2 went down. Why is this happening? I thought p-values decreased after accounting for multicollinearity?


I'm not 100% sure what you mean by "I created z scores and then created a composite", but whatever this means, it could be that the new variables you are using to account for the multi-collinearity are not as predictive of the response variable as the original variables are.

 

In any event, in the presence of multi-collinearity, I always recommend using Partial Least Square regression (PROC PLS) instead of ordinary least squares regression. Partial Least Squares generally is less affected by multicollinearity, and results in model coefficients that have less variability (lower mean squared error) and predicted values that have lower mean squared error than you would get using ordinary least squares. See http://amstat.tandfonline.com/doi/abs/10.1080/00401706.1993.10485033.

 

Also, I agree 100% with @Reeza's quote from Wikipedia.

--
Paige Miller
LucyB
Obsidian | Level 7

Well from what I remember from a course, if you have multicollinearity among some covariates (these are questionairres), you can cover them to Z scores, and then average them to be 1 variable. is this not correct?

Reeza
Super User

@LucyB wrote:

Well from what I remember from a course, if you have multicollinearity among some covariates (these are questionairres), you can cover them to Z scores, and then average them to be 1 variable. is this not correct?


No, what you're referring to is standardization which puts all variables on the same scale. It prevents variables that are bigger in size from being too influential in the model. 

LucyB
Obsidian | Level 7

but doesnt it still address multicollinearity?

LucyB
Obsidian | Level 7

Yes- the standardization itself did not address the collinearity, but because of the standardization, a composite can be calculated, which will address the collinearity from my understanding.

Reeza
Super User

When you say composite are you talking about a principal component or eigenvector? Then yes, the eigenvectors by definition are orthogonal and independent. But not all eigenvectors are used in the model which also helps 

Reeza
Super User

Just because they're not correlated with each doesn't mean they'll correlate with the dependent variable either...your initial assumption that you'd get a 'better' model is based on comparison of the p-values but as we mentioned earlier it's not just p-values. 

PaigeMiller
Diamond | Level 26

@LucyB wrote:

Yes- the standardization itself did not address the collinearity, but because of the standardization, a composite can be calculated, which will address the collinearity from my understanding.


Sure, there is less (or no) collinearity if you replace four correlated variables with one "composite" variable, but this is somewhat meaningless as you don't know how the original 4 variables can be used to predict the output. And as has been stated, this new "composite" variable may not be a good predictor.

 

Overall, I'd say this is not an approach I would recommend in this case.

--
Paige Miller
Ksharp
Super User
Did you check Variance Inflation Factor ?

proc reg
model ......... / vif ;



SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 3111 views
  • 2 likes
  • 4 in conversation