Collinearity, also called multicollinearity, refers to strong linear associations among sets of predictors. In regression models, these associations can inflate standard errors, make parameter estimates unstable, and can reduce model interpretability. In this post, I’ll explain collinearity, when it matters, how to detect it, and strategies to mitigate its undesirable effects. In a follow up post, I’ll show examples of variable reduction approaches to reduce collinearity prior to modeling.
What is collinearity?
Let’s go through the parts of the above definition for collinearity. Linear association means that as one variable increases, the other changes as well at a relatively constant rate. When collinear variables are graphed together, they tend to fall on a straight line. The term predictors in the definition tells us that collinearity is an unsupervised concept. It does not involve a response variable, only the predictors. And sets of predictors tells us that collinearity is not limited to associations between only two variables.
Collinearity can cause a variety of problems as described later in this post. It is worth noting that collinearity is not a violation of the assumptions of regression models (i.e., e~ i.i.d. N(0, σ 2 )). Regardless, collinearity should be assessed along with model assumptions as its presence can also cause modeling problems.
What causes collinearity?
Essentially all real-world data has some degree of collinearity among predictors. These associations exist for a variety of reasons. A common cause of collinearity is having predictors that measure the same underlying thing (sometimes called a latent variable). For example, measures of runners’ best 5K race times, resting heart rates, VO 2 max, etc. are likely all correlated because they measure the same underlying (and difficult to directly measure) variable, cardiovascular-fitness. Sometimes the predictors measure percentages of a whole, such as percent of income spent on housing and percent of income spent on non-housing expenses. Predictors like these that add up to 100% will necessarily be correlated since increasing one requires decreasing others. A third reason for collinearity is dumb luck. It may be that the sample you are analyzing has restricted ranges for several variables that result in linear associations. These associations might not exist in other samples from the population. This has implications for predictive modelers, who might be faced with different patterns of collinearity in their training and validation data from the data they intend to score with their model. Predictive models have reduced performance when the patterns of collinearity change between model development and scoring.
What are the effects of collinearity on explanatory regression models?
Collinearity inflates the variance of parameter estimates, making it less likely to find significant effects, even for important predictors. It also makes it harder to get good parameter estimates due to instability, which is reflected in the increased width of the confidence intervals. Instability here means that small changes in the development data can result in the large changes in the magnitude of the estimates and even changes in their sign. If the sign of the parameters is opposite from what is expected based on theory, explanation and interpretation becomes difficult.
Partial regression coefficients are attempts at estimating the effect of one predictor while holding all other predictors constant. But with strong collinearity, the data doesn’t show the effects of one predictor across the full range of the other predictors. So, the coefficients are trying to estimate something beyond what the data shows. Because of this, there are multiple possible ways to assign variance in Y to the predictors that fit the data almost equally well. This is why the standard errors of the parameter estimates are so large.
What are the effects of collinearity on predictive regression models?
Collinearity is generally more of a problem for explanatory modeling than predictive modeling. It will not reduce the predictive power of the model overall, but it will affect estimates of the individual parameters. Sometimes a highly significant model will have all non-significant individual predictors due to collinearity. These models can still make reliable predictions. Collinearity also inflates the variances of the predicted values. This can be especially severe if the values of X’s used to make predictions are outside the range of the development data.
Additionally, collinearity causes automated variable selection methods to perform poorly. Stepwise methods (forward, backwards, and stepwise selection) can result in a random choice of which of the collinear variables enter the model. The specific variables that enter and the order of entry can alter the trajectory of the variable selection process. This may limit the process from finding the set of predictors that produces the lowest validation error.
The problems of unstable magnitude or sign can sometimes be important for predictive modelers too. If the model needs to be explained or interpreted, having theoretically implausible parameters will reduce the explainability of the model.
Detecting collinearity
As mentioned previously, a significant overall model with all non-significant predictors is usually a sign of collinearity. But typically, it is not obvious if collinearity exists in a particular data set. Three common approaches for detecting collinearity are calculation of Pearson correlation coefficients, variance inflation factors, and condition index values.
Correlation coefficients are a good first approach but should not be the sole method for assessing collinearity. This is because correlations cannot identify collinearity that exists between 3 or more predictors. For example, imagine a data set with test scores for a high school class: Exam 1 (E1), Exam 2 (E2) and Final Grade (F), with F calculated at the mean of E1 and E2. There might only be weak associations between any pair of E1, E2, and F, but the data have perfect collinearity. Knowing any two of these three predictors give you perfect information about the other variable.
A correlation coefficient of r= 0.8–0.9 or higher can indicate a collinearity problem. The SAS 9 procedure PROC CORR and the SAS Viya procedure PROC CORRELATION can calculate Pearson Correlation coefficients. The Pearson correlation, r is calculated as follows:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Variance inflation factors (VIF) can be calculated by regressing each predictor on the set of other predictors to be used in a model and plugging in the coefficient of determination (R 2 ) into the following equation:
Each predictor will have its own VIF. So if regression of a predictor on the remaining predictors produces R 2 =0.9, this corresponds to a variance inflation factor of 10. VIF=10 is considered an indication of a serious collinearity problem that should be addressed. Some researchers consider VIF>8 to be problematic. Variables with high VIF can be considered to be causing the collinearity problem. As the name suggests, VIF describes the factor by which the parameter’s variance is inflated compared to a model with no collinearity.
VIF can be computed using the VIF option in the MODEL statement of either PROC REGSELECT (SAS VIYA) or with PROC REG (SAS 9). An example of calculating VIF with PROC REG for the variables in SASHELP.IRIS shows petal length and petal width are collinear:
This is not surprising as these variables have a strong correlation of r=0.96. If petal length is dropped, all remaining VIF scores become < 4. While these data contain two collinear variables which could have been detected with correlations, VIF has the advantage of looking for relationships among three or more predictors.
Condition index values involve principal component analysis (PCA). For a brief description of PCA including a discussion of determining the number of components to use please see my previous post: How many principal components should I keep? Part 1: common approaches. Calculating condition index values requires computations of the eigenvectors (i.e., principal components, PCs) and eigenvalues (variances of PCs) of the sums of squares and cross products matrix X`X, with X being the design matrix of predictors. The condition index values are the square root of the ratio of the first eigenvalue to each eigenvalue. Since the last eigenvalue will always be the smallest, it will have the largest condition index value.
Here is an illustration to help think about condition index values. On the left is pictured correlated data showing heights and weights for a group of people, while on the right is relatively uncorrelated data showing heights and incomes of the same people. For each data set, PCA finds PC1, a new variable that is a linear transformation of the two variables that has the greatest variance. A second new variable, PC2, is created so that it explains the second greatest proportion of variation in the original variables and is perpendicular to PC1. The eigenvalues (variances) of PC1 and PC2 are λ 1 and λ 2 respectively.
In the correlated data, λ 1 is much bigger than λ 2 , so the square root of the ratio of λ 1 / λ 2 (the condition index) will be a large number. This large number is basically indicating the data is spread along a line, with a long direction and a narrow direction in space.
Compare this with the uncorrelated data. The data looks more like a ball than a line. λ 1 and λ 2 here have similar magnitudes, so the condition index will be close to 1. A small condition index indicates no collinearity, with the data points spreading out randomly in space. Again, although demonstrated with two variables, this has the advantage of being applicable to three or more predictors.
Condition index values can be calculated by using the COLLIN and COLLINOINT option in the MODEL statement in PROC REG. COLLIN calculates condition index values for each column of the design matrix X including the intercept column. THE COLLINOINT removes the intercept from the calculations and will report one fewer condition index. Some authors suggest only using COLLIN if the intercept has physical reality in the sense that it is a possible value and within the range of the data. Otherwise COLLINOINT is recommended.
Some authors recommend condition index (CI) values >100 as indicating a severe collinearity problem, others recommend using CI>30. Bear in mind that these cutoffs are as arbitrary as using p<0.05 to assess significance. Once high condition index value principal components have been found, variables that have more than half their variability associated with the high valued components can be considered to be causing the collinearity problem.
Below is an example of the COLLINOINT table from PROC REG for the variables in SASHELP.IRIS. It shows the 4 th PC (last row) has a condition index of 11, and 99% and 82% percent of the variability of petal length and petal width respectively are associated with this fourth principal component. A CI of 11 is well under the commonly used threshold for concern of 100. Condition index values excluding the intercept tend to be lower than when the intercept is included. The condition index for the 5 th PC when the intercept was included = 51.
How can we check for collinearity involving categorical predictors?
The SAS Viya procedure REGSELECT can model categorical predictors as well as continuous predictors. The VIF option in the MODEL statement will calculate variance inflation factors for all the parameters including the dummy variables that are created for each categorical predictor.
The SAS 9 procedure PROC GLMSELECT can export the design matrix X, which includes dummy variables for any categorical predictors. These data can be read into PROC REG which can be used for calculating VIF and condition index values for the categorical and continuous predictors.
Dealing with collinearity
Many of the methods of dealing with collinearity are attempts to reduce predictor variances. Most of these involve either variable selection (removing redundant variables) or modified parameter estimates (e.g. biased regression methods). Both types of approaches can be applied at the same time. Here are 6 approaches to consider for your data.
Increase sample size
Often data collection is done and finalized before analysis so this may not be an option for many people. But if possible, collecting more data will likely reduce the standard errors of the parameters which is the main problem of collinearity. Further, if the collinearity is due to dumb luck of the sample having limited ranges of predictors, the collinearity may disappear in a larger sample. Increasing the sample size will usually decrease standard errors and make it less likely that results are some sort of sampling fluke.
Remove predictors
Variables with high VIFs (or a set of dummy variables for a categorical predictor) can be dropped one at a time until VIF or condition indices go below your chosen threshold. These can be considered redundant variables. Keep in mind that for explanatory modeling, dropping variables is equivalent to assuming the coefficients are zero for these predictors. Variable reduction can be done prior to model fitting using the SAS Viya procedure PROC VARREDUCE. I will describe this and other variable reduction approaches in my next post.
Recoding variables
Sometimes collinear variables can be combined into a single new predictor that will have lower variance. A related approach is to run a factor analysis on collinear predictors and to use the resulting factor scores as predictors. If combining variables does not make sense, another strategy is to recode the predictors to reduce their correlation. For example, imagine data in which the variables garbage production and power consumption in households are highly correlated. The researcher believes that the correlation is because households with more people have will have higher values of both. They can try redefining these variables as garbage production per person and power usage per person in the household to break the correlation. Hopefully these modified variables are similar enough to the originals to answer the same research questions
Biased regression techniques
What if all the predictors are theoretically important and none can be reasonably thrown out solely to reduce standard errors? In this case, biased-regression techniques such PCA regression or ridge regression can be used as they retain all the predictors. PCA produces new variables in equal number to the original variables that are all perfectly uncorrelated, meaning there is no collinearity. These uncorrelated derived variables are the predictors in PCA regression. Ridge regression involves shrinking regression coefficients towards zero which decreases the variability of the model’s predictions. Shrinkage adds a small amount of bias which causes a larger reduction in variance due to the bias-variance trade-off. For a review of the Bias-Variance trade-off, please see my previous post Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off.
Centering
This could be included under “recode variables”, but it is a simple fix for a common cause of collinearity, so I decided to list it separately. When a fitting a model with polynomial (e.g. Y= X+X 2 +X 3 ) or interaction effects (y = X1 + X2+ X1*X2), you can expect collinearity between the lower order term and the higher order term that contains it. This kind of collinearity can be reduced by mean-centering the continuous predictor prior to calculating the polynomial or interaction. SAS statistical procedures with an EFFECT statement such as the LOGISTIC, GLIMMIX, and GLMSELECT procedures can center polynomials using the EFFECT statement option STANDARDIZE (METHOD=MOMENTS)=CENTER. Centering variables for use in interactions can be done using the PROC STDIZE option METHOD=MEAN.
Do nothing
As mentioned previously, multicollinearity is not a violation of the assumptions of regression. Sometimes the best thing to do is just to be aware of the presence of multicollinearity and understand its consequences. High VIFs can often be ignored when standard errors are small relative to parameter estimates and predictors are significant despite the increased variance. And again, if the goal is solely prediction, the effects of collinearity are less problematic.
In my next post, I will show how to remove collinearity prior to modeling using PROC VARREDUCE and PROC VARCLUS for variable reduction.
Find more articles from SAS Global Enablement and Learning here.
... View more