According to ChatGPT (🤔) I can use the TOL and VIF options in PROC REG to assess multicollinearity.
PROC REG DATA=mydata;
MODEL dependent_variable = independent_variable1 independent_variable2 independent_variable3 / TOL VIF;
RUN;
The tolerance and VIF values in the resulting table are for each independent variable w/ the dependent variable, correct?
But that's not what I'm seeking. I'm interested in multicollinearity between independent variables, not between each independent variable and the dependent variable.
I'm trying to produce a matrix similar to the correlation matrix in PROC CORR, but with Tolerance and VIF.
Any recommendations for how to do this?
Thanks.
A predictor is collinear if it is strongly correlated with is some linear combination of the other predictors. It is not like a correlation matrix where each cell of the matrix is a correlation of just one variable with one other variable. Collinearity can be more complex by involving multiple variables. So, a matrix of values like you want is not possible. You might find the results produced by the COLLIN or COLLINOINT options more helpful. As noted in "Collinearity Diagnostics" in the Details section of the PROC REG documentation, the results include the condition numbers, which are derived from the eigenvalues of X'X. If a condition number is much larger than 10, that indicates that collinearity exists. The eigenvector values displayed next to such a large condition number describe the relation among the individual predictors. See this note that discusses collinearity in generalized linear models (the ordinary regression model is just one type) and shows an example of using the COLLIN and COLLINOINT options to detect and understand collinearity in a model.
A predictor is collinear if it is strongly correlated with is some linear combination of the other predictors. It is not like a correlation matrix where each cell of the matrix is a correlation of just one variable with one other variable. Collinearity can be more complex by involving multiple variables. So, a matrix of values like you want is not possible. You might find the results produced by the COLLIN or COLLINOINT options more helpful. As noted in "Collinearity Diagnostics" in the Details section of the PROC REG documentation, the results include the condition numbers, which are derived from the eigenvalues of X'X. If a condition number is much larger than 10, that indicates that collinearity exists. The eigenvector values displayed next to such a large condition number describe the relation among the individual predictors. See this note that discusses collinearity in generalized linear models (the ordinary regression model is just one type) and shows an example of using the COLLIN and COLLINOINT options to detect and understand collinearity in a model.
I can't speak for @StatDave but I would take the second ocurrence of the word "is" out of that sentence. So in my opinion it should read
"A predictor is collinear if it is strongly correlated with some linear combination of the other predictors"
Since you have proposed the "matrix version of the indicators of collinearity", I think it necessary for you to delve a bit deeper into the essence of tolerance and variance inflation factor (VIF).
The essence I wish to talk about is: How are the two statistics computed? I think you will find it unnecessary to generate a "matrix version of the indicators of collinearity" once you have mastered the way that the two statistics are computed (at least I think so).
Before we talk about the two statistics, let's talk about the definition of collinearity first. In short, for a given set of variables, namely , if a series of constants, is existent in making this equation true: , then we say that collinearity exists among .
Let's talk about tolerance now. How is it computed? In short, if we want to calculate the tolerance of (p=1, 2, 3, ..., n), then we use as the dependent variable and all of the other variables (i.e. ) as the independent variable to form a multivariate linear regression. We can calculate a regression coefficient for this linear regression. Tolerance is the result of 1 subtracts that regression coefficient.
As for variance inflation index, understanding that statistic would be easy once you've come to know the calculation formula of tolerance. VIF is the inverse of tolerance (i.e. ).
So now you can see, for each independent variable, we can generate a pair of statistics that measure its collinearity. Since the tolerance as well as VIF of each and every independent variable take both the information of that variable and other independent variables in the model into account simultaneously, there is no need for you to compute a matrix that specifically tells you the collinearity between one independent variable and another one.
In addition to tolerance and VIF, conditional index is also another indicator of collinearity. The computation of condition index is more complex than tolerance and VIF, since eigenvectors are key players in the computation of conditional index. You need mathematical knowledge on matrix algebra before you can understand the computation of condition index. From a practical perspective, a condition index larger than 10 suggests collinearity, while a condition index larger than 30 suggests severe collinearity. You can let SAS compute condition index by using the method demonstrated by @StatDave. In practice, SAS generates a table (two tables if you specify the collinoint option) and I usually look from the bottom of the table(s) to the top, since the largest condition index is at the bottom of the table(s). If the largest condition index is smaller than 10, then the result suggest that no collinearity exist.
Collinearity is an aspect of statistical diagnostics (or regression diagnostics). You can refer to a book on multivariate statistics or a monograph on statistical diagnostics if you want more information. But please be aware of the fact that many statistical methods (like the computation of condition index) are based on matrix algebra, so it may be necessary in some circumstances that you master (at least know something about) matrix algebra before you study statistics.
The tolerance and VIF values in the resulting table are for each independent variable w/ the dependent variable, correct?
Incorrect.
VIF is independent of Y. It is a measure of how much each independent variable is affected by multi-collinearity — specifically how much the variance of the estimate of the coefficient of that factor is inflated by multi-collinearity.
Same for tolerance. 1/Tolerance=VIF
@PaigeMiller @StatDave Thanks for your help here!
So the values resulted from the TOL and VIF options on the MODEL statement in PROC REG are for each factor, but are representative of the combined multicollinearity of all the independent variables in the model, correct?
According to this textbook, “In general, small tolerance values, including those below 0.25, are worrisome, and those below 0.10 are serious.” And, “Variance inflation factors greater than 2.5 may be problematic, whereas values greater than ten are serious”
Given the above, and my objective of wanting to diagnose multicollinearity that needs to be addressed, do the COLLIN or COLLINOINT options, and the condition numbers, offer additional important information that I should be considering?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.