BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
_maldini_
Barite | Level 11

According to ChatGPT (🤔) I can use the TOL and VIF options in PROC REG to assess multicollinearity.

PROC REG DATA=mydata;
MODEL dependent_variable = independent_variable1 independent_variable2 independent_variable3 / TOL VIF;
RUN;

The tolerance and VIF values in the resulting table are for each independent variable w/ the dependent variable, correct?

Screenshot 2023-03-09 at 11.33.35 AM.png

But that's not what I'm seeking. I'm interested in multicollinearity between independent variables, not between each independent variable and the dependent variable.

 

I'm trying to produce a matrix similar to the correlation matrix in PROC CORR, but with Tolerance and VIF.

Any recommendations for how to do this?

 

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

A predictor is collinear if it is strongly correlated with is some linear combination of the other predictors. It is not like a correlation matrix where each cell of the matrix is a correlation of just one variable with one other variable. Collinearity can be more complex by involving multiple variables. So, a matrix of values like you want is not possible. You might find the results produced by the COLLIN or COLLINOINT options more helpful. As noted in "Collinearity Diagnostics" in the Details section of the PROC REG documentation, the results include the condition numbers, which are derived from the eigenvalues of X'X. If a condition number is much larger than 10, that indicates that collinearity exists. The eigenvector values displayed next to such a large condition number describe the relation among the individual predictors. See this note that discusses collinearity in generalized linear models (the ordinary regression model is just one type) and shows an example of using the COLLIN and COLLINOINT options to detect and understand collinearity in a model.

View solution in original post

9 REPLIES 9
StatDave
SAS Super FREQ

A predictor is collinear if it is strongly correlated with is some linear combination of the other predictors. It is not like a correlation matrix where each cell of the matrix is a correlation of just one variable with one other variable. Collinearity can be more complex by involving multiple variables. So, a matrix of values like you want is not possible. You might find the results produced by the COLLIN or COLLINOINT options more helpful. As noted in "Collinearity Diagnostics" in the Details section of the PROC REG documentation, the results include the condition numbers, which are derived from the eigenvalues of X'X. If a condition number is much larger than 10, that indicates that collinearity exists. The eigenvector values displayed next to such a large condition number describe the relation among the individual predictors. See this note that discusses collinearity in generalized linear models (the ordinary regression model is just one type) and shows an example of using the COLLIN and COLLINOINT options to detect and understand collinearity in a model.

_maldini_
Barite | Level 11
Is there supposed to be an "or" here: "A predictor is collinear if it is strongly correlated with, OR is some linear combination of the other predictors."
PaigeMiller
Diamond | Level 26

I can't speak for @StatDave but I would take the second ocurrence of the word "is" out of that sentence. So in my opinion it should read

 

"A predictor is collinear if it is strongly correlated with some linear combination of the other predictors"

--
Paige Miller
Season
Lapis Lazuli | Level 10

Since you have proposed the "matrix version of the indicators of collinearity", I think it necessary for you to delve a bit deeper into the essence of tolerance and variance inflation factor (VIF).

The essence I wish to talk about is: How are the two statistics computed? I think you will find it unnecessary to generate a "matrix version of the indicators of collinearity" once you have mastered the way that the two statistics are computed (at least I think so).

Before we talk about the two statistics, let's talk about the definition of collinearity first. In short, for a given set of variables, namely Eqn001.gif, if a series of constants, Eqn001.gifis existent in making this equation true: Eqn001.gif, then we say that collinearity exists among Eqn001.gif.

Let's talk about tolerance now. How is it computed? In short, if we want to calculate the tolerance of Eqn001.gif (p=1, 2, 3, ..., n), then we use Eqn001.gif as the dependent variable and all of the other variables (i.e. Eqn001.gif) as the independent variable to form a multivariate linear regression. We can calculate a regression coefficient for this linear regression. Tolerance is the result of 1 subtracts that regression coefficient.

As for variance inflation index, understanding that statistic would be easy once you've come to know the calculation formula of tolerance. VIF is the inverse of tolerance (i.e. Eqn001.gif).

So now you can see, for each independent variable, we can generate a pair of statistics that measure its collinearity. Since the tolerance as well as VIF of each and every independent variable take both the information of that variable and other independent variables in the model into account simultaneously, there is no need for you to compute a matrix that specifically tells you the collinearity between one independent variable and another one.

In addition to tolerance and VIF, conditional index is also another indicator of collinearity. The computation of condition index is more complex than tolerance and VIF, since eigenvectors are key players in the computation of conditional index. You need mathematical knowledge on matrix algebra before you can understand the computation of condition index. From a practical perspective, a condition index larger than 10 suggests collinearity, while a condition index larger than 30 suggests severe collinearity. You can let SAS compute condition index by using the method demonstrated by @StatDave. In practice, SAS generates a table (two tables if you specify the collinoint option) and I usually look from the bottom of the table(s) to the top, since the largest condition index is at the bottom of the table(s). If the largest condition index is smaller than 10, then the result suggest that no collinearity exist.

Collinearity is an aspect of statistical diagnostics (or regression diagnostics). You can refer to a book on multivariate statistics or a monograph on statistical diagnostics if you want more information. But please be aware of the fact that many statistical methods (like the computation of condition index) are based on matrix algebra, so it may be necessary in some circumstances that you master (at least know something about) matrix algebra before you study statistics.

PaigeMiller
Diamond | Level 26

The tolerance and VIF values in the resulting table are for each independent variable w/ the dependent variable, correct?

 

Incorrect.

 

VIF is independent of Y. It is a measure of how much each independent variable is affected by multi-collinearity — specifically how much the variance of the estimate of the coefficient of that factor is inflated by multi-collinearity.

 

Same for tolerance. 1/Tolerance=VIF

--
Paige Miller
_maldini_
Barite | Level 11

@PaigeMiller @StatDave Thanks for your help here!

 

So the values resulted from the TOL and VIF options on the MODEL statement in PROC REG are for each factor, but are representative of the combined multicollinearity of all the independent variables in the model, correct?

 

According to this textbook, “In general, small tolerance values, including those below 0.25, are worrisome, and those below 0.10 are serious.” And, “Variance inflation factors greater than 2.5 may be problematic, whereas values greater than ten are serious”

 

Given the above, and my objective of wanting to diagnose multicollinearity that needs to be addressed, do the COLLIN or COLLINOINT options, and the condition numbers, offer additional important information that I should be considering?

StatDave
SAS Super FREQ
Yes, they do provide useful additional information. As I noted earlier, the eigenvector values associated with any large condition number indicating collinearity tell you something about which variables are collinear.
_maldini_
Barite | Level 11
Thank you!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 3933 views
  • 6 likes
  • 4 in conversation