I am running a multiple linear regression model and I have 8 covariates, 4 of them are highly correlated (r>0.7). So I created z scores and then created a composite. When I re-ran the model with this composite, my predictor p-value became significantly larger and my R2 went down. Why is this happening? I thought p-values decreased after accounting for multicollinearity?
It's more than just the P-Value.
An excerpt from Wikipedia that's relevant here:
So long as the underlying specification is correct, multicollinearity does not actually bias results; it just produces large standard errors in the related independent variables. More importantly, the usual use of regression is to take coefficients from the model and then apply them to other data. Since multicollinearity causes imprecise estimates of coefficient values, the resulting out-of-sample predictions will also be imprecise. And if the pattern of multicollinearity in the new data differs from that in the data that was fitted, such extrapolation may introduce large errors in the predictions.
@LucyB wrote:
I am running a multiple linear regression model and I have 8 covariates, 4 of them are highly correlated (r>0.7). So I created z scores and then created a composite. When I re-ran the model with this composite, my predictor p-value became significantly larger and my R2 went down. Why is this happening? I thought p-values decreased after accounting for multicollinearity?
I'm not 100% sure what you mean by "I created z scores and then created a composite", but whatever this means, it could be that the new variables you are using to account for the multi-collinearity are not as predictive of the response variable as the original variables are.
In any event, in the presence of multi-collinearity, I always recommend using Partial Least Square regression (PROC PLS) instead of ordinary least squares regression. Partial Least Squares generally is less affected by multicollinearity, and results in model coefficients that have less variability (lower mean squared error) and predicted values that have lower mean squared error than you would get using ordinary least squares. See http://amstat.tandfonline.com/doi/abs/10.1080/00401706.1993.10485033.
Also, I agree 100% with @Reeza's quote from Wikipedia.
Well from what I remember from a course, if you have multicollinearity among some covariates (these are questionairres), you can cover them to Z scores, and then average them to be 1 variable. is this not correct?
@LucyB wrote:
Well from what I remember from a course, if you have multicollinearity among some covariates (these are questionairres), you can cover them to Z scores, and then average them to be 1 variable. is this not correct?
No, what you're referring to is standardization which puts all variables on the same scale. It prevents variables that are bigger in size from being too influential in the model.
but doesnt it still address multicollinearity?
Not really, standardized variables can still be correlated.
Yes- the standardization itself did not address the collinearity, but because of the standardization, a composite can be calculated, which will address the collinearity from my understanding.
When you say composite are you talking about a principal component or eigenvector? Then yes, the eigenvectors by definition are orthogonal and independent. But not all eigenvectors are used in the model which also helps
Just because they're not correlated with each doesn't mean they'll correlate with the dependent variable either...your initial assumption that you'd get a 'better' model is based on comparison of the p-values but as we mentioned earlier it's not just p-values.
@LucyB wrote:
Yes- the standardization itself did not address the collinearity, but because of the standardization, a composite can be calculated, which will address the collinearity from my understanding.
Sure, there is less (or no) collinearity if you replace four correlated variables with one "composite" variable, but this is somewhat meaningless as you don't know how the original 4 variables can be used to predict the output. And as has been stated, this new "composite" variable may not be a good predictor.
Overall, I'd say this is not an approach I would recommend in this case.
Did you check Variance Inflation Factor ? proc reg model ......... / vif ;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.