@Season wrote:
Thank you for your kind and repetitive reminder. In fact, I had just begun reading the note you mentioned when I was replying to @PaigeMiller yesterday. I am now fully informed of the fact that weights should be multiplied when it comes to diagnosing collinearity in generalized linear models.
Still, I have some questions:
(1) I noticed that the var argument of PROC STANDARD standardizes all of the independent variables in the logistic model (li, temp and cell). Now that collinearity exists only between variable temp and the intercept, does all of the independent variables have to be standardized?
The link talks about a case where one of the predictor variables is highly correlated with the Intercept. It says: "The variation proportions associated with this large condition index suggest that TEMP is collinear with the intercept." and it also says "By rescaling the predictors, the collinearity with the intercept can be removed."
So standardizing the variables will remove the effect of high correlations if there are only large correlations with the intercept. If there are high correlations between one predictor variable (not the intercept) and another predictor variable (not the intercept), then standardizing the variables does not help remove this high correlation. (And so, the link is correct for the specific case of correlation with the intercept, it does not apply in general to high correlations between predictor variables).
Thank you, Paige, for your reply! Another issue I concern is the nomenclature of the statistics generated in the process we have been focused on in the past few days.
Another user of the community raised a question of the computation of GVIF in logistic regression model. I wonder if the VIF computed from the weighted information matrix can still be termed as "VIF". Should VIF calculated from weighted information matrix in generalized linear models be termed as "GVIF"?
I don't think that standardization is a general solution to collinearity and its effect on the model.
I take another look at that cancer remission example in my note on penalty-based selection methods (LASSO, ridging, and elastic net). See the introductory text and, in particular, Example 3 which shows how the problem found in the first note on collinearity can be addressed using penalty-based selection. The LASSO method is shown using NLMIXED, which shows how the penalty is applied to the binomial likelihood when fitting the model, and with HPGENSELECT which simplifies its use. It also shows how the dual-penalty (LASSO+ridging), elastic net method can be applied using NLMIXED. The point is that collinearity, whether involving the intercept or other predictors, can be avoided using these shrinkage methods to select effects to stay in the model.
Thank you for your reply! With the help of you and @PaigeMiller , I think that I have a grasped the flowchart of dealing with collinearity in generalized linear models. Thank you both for your patience, kindness and expertise!
I have been insisting on consulting the problem of dealing with collinearity in the context of using the "ordinary" logistic regression model. The underlying reasons for this are as follows: (1) I hardly know anything about ridge regression and LASSO other than the names of the statistical methods and the most basic knowledge about them (they do not adopt unbiased estimation and that they are suitable for dealing with collinearity). (2) It has been reported that a larger sample is required for modern penalization methods like LASSO and elastic net when it comes to building clinical prediction models, but the author of the paper I cited simply stated that "further research on sample size requirement on methods like LASSO is required", thus possibly implying that no conclusion has been reached on that issue. In this case, sample size and power calculation may be a tough issue, at least for researchers building logistic regression as a clinical prediction model.
Hello, I encountered problems when I was dealing with collinearity in logistic regression models with your note today.
(1) Should diagnostics and treatment of collinearity precede or follow the variable selection process? That issue is what I had not think of prior to analyzing my own data. A dilemma of the researcher when collinearity is found is whether or not to trust the statistical results based on the analysis prior to diagnostics of collinearity. Explicitly speaking, collinearity can cause wrong results, thus leading to "fake" statistics (e.g. P-value of the independent variables, estimation of the regression coefficients (β)). In this case, the researcher cannot ascertain whether the insignificances of the independent variables are caused by collinearity or truth (it is insignificant per se), arbitrarily excluding the independent variables via model selection methods may result in the situation where truly significant variables be excluded from collinearity diagnostics and the final model building process.
(2) I tried with my own data today and found out that my data perfected matched with the circumstance you demonstrated in your note. Collinearities of four independent variables with the intercept were found, while no collinearity was found among the independent variables. Collinearity was not found if the intercept was excluded via the collinnoint argument in PROC REG. However, collinearity persisted after variable standardization. The very four variables that are proved to collinear with the intercept were still collinear with the intercept after all of the independent variables were standardized. What is more, collinearity still disappeared after excluding the intercept, indicating the four variables were still collinear with the intercept, even after variable standardization.
What should I do now?
@Season wrote:
OK, I see. Computing VIF in PROC REG when the dependent variable is a continuous one is easy. Yet the question I raised earlier is the computation of VIF in a logistic regression model. Can SAS do that? Thanks!
No you don't see. I said: "To use the VIF in PROC REG, you create a made up variable that is a continuous Y and use your X-variables. The VIF does not depend on the Y variable."
And of course, there's also the point made by @StatDave
Oh, yes, you were totally correct (laugh oh laugh), I surely did not see yesterday.
@PaigeMiller wrote:
@Season wrote:
OK, I see. Computing VIF in PROC REG when the dependent variable is a continuous one is easy. Yet the question I raised earlier is the computation of VIF in a logistic regression model. Can SAS do that? Thanks!
No you don't see. I said: "To use the VIF in PROC REG, you create a made up variable that is a continuous Y and use your X-variables. The VIF does not depend on the Y variable."
I did not completely understand what you mean yesterday and was merely focusing one the word "continuous" before the letter "Y". I thought you must have misunderstood me, since I made a lengthy reply yesterday, with my questions hidden between the lines. As a result, you may had just skimmed through my reply, without noticing that I was modeling a discrete independent variable. So that's the underlying reason for my further reply you quoted.
And of course, there's also the point made by @StatDave
Now I think that I have truly seen what you meant. But other questions emerged when I was reading the note @StatDave cited. I have already raised my questions in my latest reply to @StatDave. I wonder if you could offer your suggestion to the three questions, if you don't mind.
Thank you very much for your patience and your time spent on my questions again!
Hello, Paige. I tried to use the note provided by @StatDave to analyze my data today and encountered some problems.
(1) Should diagnostics and treatment of collinearity precede or follow the variable selection process? That issue is what I had not think of prior to analyzing my own data. A dilemma of the researcher when collinearity is found is whether or not to trust the statistical results based on the analysis prior to diagnostics of collinearity. Explicitly speaking, collinearity can cause wrong results, thus leading to "fake" statistics (e.g. P-value of the independent variables, estimation of the regression coefficients (β)). In this case, the researcher cannot ascertain whether the insignificances of the independent variables are caused by collinearity or truth (it is insignificant per se), arbitrarily excluding the independent variables via model selection methods may result in the situation where truly significant variables be excluded from collinearity diagnostics and the final model building process.
(2) I tried with my own data today and found out that my data perfected matched with the circumstance @StatDave demonstrated in the note he/she cited. Collinearities of four independent variables with the intercept were found, while no collinearity was found among the independent variables. Collinearity was not found if the intercept was excluded via the collinnoint argument in PROC REG. However, collinearity persisted after variable standardization. The very four variables that are proved to collinear with the intercept were still collinear with the intercept after all of the independent variables were standardized. What is more, collinearity still disappeared after excluding the intercept, indicating the four variables were still collinear with the intercept, even after variable standardization. What should I do now?
Thank you for your suggestion!
What you describe could be the result of representing a categorical predictor, that has k levels, with k dummy variables in the MODEL statement in PROC REG. You should instead represent it with k-1 variables, equal to the number of its degrees of freedom. Unfortunately, PROC REG does not have a CLASS statement that would make this easier. But there are several procedures you can use to create a data set that expands categorical predictors into an appropriate set of dummy variables. In one of those procedures you should use the PARAM=REF option to create the k-1 dummy variables for this situation.
Thank you for your reply!
First of all, I would like to answer my own question that was raised a few days ago (so that other users of the Community can save their time on looking up information): diagnostics of collinearity should precede the variable selection process. I looked up a book on multivariate statistics and referred to the part on collinearity (of multivariate linear regression), it said (original text not in English): "Multicollinearity (an alias of collinearity) is the distortion of the model estimation or inability of accurate estimation caused by the precise correlation of or the fact that strong correlation exists among the covariates (can be interpreted as "independent variables" in this setting) in the linear regression model. Therefore, prior to regression, knowing the relationship among the covariates is of great importance". The text I translated clear pointed out that diagnostics of collinearity should precede the variable selection process. I also consulted one of my teachers responsible for teaching us SAS. She also stated that diagnostics of collinearity should precede the variable selection process.
As for the reason underlying my failure to diminish collinearity, I myself searched for an answer after raising my question. You mentioned that an inappropriate setting of dummy variables may be one of the underlying causes of the problem. I am gratitude for your pointing out that issue (so I will never make such a mistake in my upcoming data analysis), but unfortunately this was not the case in my problem.
I read the note you had mentioned again and noticed that the underlying cause of collinearity between the independent variable and the intercept is the disproportionally small standard deviation of the variable that exhibit collinearity with the intercept. I reviewed my model and found out that I put one continuous independent variable alongside dummy variables in the model (the reason why I did so was that compared with other variables, the range of the continuous variable is relatively small, so in order to retain more information of my data, I put the continuous variable directly in the model) and that the mean of other independent variables (dummy variables) are close to 0.5, with their standard deviation ranging from 0.5-0.6. However, the continuous variable had a mean of around 6 and a standard deviation of around 1. The standard deviation of the continuous variable was smaller than its mean while the standard deviations of the discrete (dummy) variables were larger than their means. In other words, compared with the other independent variables, the standard deviation of the continuous variable was disproportionally small.
In the first place, I tackled the problem by standardizing all the variables into the variables with 1 as their standard deviation (just like the case the note you referred). However, the largest condition index computed from the weighted information matrix was 11 prior to variable standardization (the second largest was 8, so there is no need to concern about that); the largest condition index computed from the weighted information matrix was 12 after variable standardization, with the very same variable still exhibiting collinearity with the intercept and that no collinearity was observed when intercept was removed from analysis. In other words, using 1 as the standard deviation in PROC STANDARD did not help to reduce collinearity at all.
I decided to try several different standard deviations in PROC STANDARD. First, I tried 0.5, producing even more severe collinearity (the largest condition index computed from the weighted information matrix was 29). Then, I tried 3. This time, collinearity disappeared.
So, in conclusion, choosing the right standard deviation in PROC STANDARD is of vital importance in dealing with collinearity with the variable standardization method. One should not choose the standard deviation in PROC STANDARD arbitrarily, i.e. without observing the exact number of the statistics.
If you are talking about correlation between pairs of the independent variables, then standardizing will not make this correlation disappear. If you are talking about correlation between one independent variable and the intercept, then standardizing will have some benefit.
For example, suppose your data has Temperature_F and Temperature_C. These are perfectly correlated (correlation=1), and no amount of standardizing will make that go away. If there is some measurement noise, so the correlation is (for example) 0.95, no amount of standardizing will make that correlation go away.
Using two variables which are highly correlated (correlation=0.94102) from data set SASHELP.CARS, please take a look:
proc corr data=sashelp.cars;
var mpg_city mpg_highway;
run;
proc reg data=sashelp.cars;
model msrp=mpg_city mpg_highway/vif;
run;
/* Convert variables to mean of zero and standard deviation of 1 */
proc stdize data=sashelp.cars out=cars method=std;
var mpg_city mpg_highway;
run;
/* Determine correlations and VIF on standardized variables, you get the exact same correlation and VIF as before */
proc corr data=cars;
var mpg_city mpg_highway;
run;
proc reg data=cars;
model msrp=mpg_city mpg_highway/vif;
run;
/* Multiply by 10, so the standard deviation is now 10 */
data cars;
set cars;
mpg_city=mpg_city*10;
mpg_highway=mpg_highway*10;
run;
/* Determine correlations and VIF on standardized variables, you get the exact same correlation and VIF as before */
proc corr data=cars;
var mpg_city mpg_highway;
run;
proc reg data=cars;
model msrp=mpg_city mpg_highway/vif;
run;
As you can see, standardizing so that each variable had a standard deviation of 1 did not make correlation go away. It had zero impact on the correlation. Standardizing so that each variable had a standard deviation of 10 did not make correlation go away. It had zero impact on the correlation.
As you can see, standardizing so that each variable had a standard deviation of 1 did not make VIF any smaller. It had zero impact on the VIF. Standardizing so that each variable had a standard deviation of 10 did not make VIF any smaller. It had zero impact on the VIF.
When you have correlation between independent variables, I can think of three ways to proceed:
As I stated earlier, my preference is to use Partial Least Squares, which has been used successfully in (probably) thousands of published articles. An article by Tobias (of SAS Institute) illustrates a situation where there are 1000 highly correlated variables, and a useful model is developed without any variable selection step, this model being "robust" against the effects of multi-collinearity. The logistic version of Partial Least Squares also works very well, and is available in an R package (not available in SAS, unless you write your own macro to do this).
Thank you, Paige, for giving me your suggestion as well as a detailed explanation once again. Your information has benefited me a lot. But I am afraid that you were not fully aware of the specific question I raised. I talked about correlation among variables when I was translating the definition on collinearity in a reply to @StatDave. The reply did not contain any question at all, but was instead a record of the procedure I solved the problem I raised on my own in the past few days. I wrote the passage to remind other readers of the thread on the details of tackling collinearity in logistic regression models.
The question I consulted was written in other reply in which I @ you. In brief, I found some of the interaction terms of my logistic regression model statistically and professionally significant. So I am wondering the exact way of dealing with collinearity when interaction terms are included. Is there anything different from the way mentioned in the note provided by @StatDave? Please refer to the thread in which I @ you for details.
Finally, I would like to ask on the sample size requirement of Logistic Partial Least Squares Regression. I have come to known that LASSO requires a larger sample size, yet there was no conclusion on the specific calculation formula of sample size, at least in the case in which a logistic regression model is used as a clinical prediction model. What about Logistic Partial Least Squares Regression? Is there a specific calculation formula?
Many thanks!
@Season wrote:
Thank you, Paige, for giving me your suggestion as well as a detailed explanation once again. Your information has benefited me a lot. But I am afraid that you were not fully aware of the specific question I raised. I talked about correlation among variables when I was translating the definition on collinearity in a reply to @StatDave. The reply did not contain any question at all, but was instead a record of the procedure I solved the problem I raised on my own in the past few days. I wrote the passage to remind other readers of the thread on the details of tackling collinearity in logistic regression models.
I wrote that code and explanation because somewhere along the way in this very long thread, I believe you have had a misunderstanding and have latched onto standardizing as a way to reduce the correlation (or multi-collinearity) in the data. Standardizing does not do that (except in the case of variable correlated with the intercept).
The question I consulted was written in other reply in which I @ you. In brief, I found some of the interaction terms of my logistic regression model statistically and professionally significant. So I am wondering the exact way of dealing with collinearity when interaction terms are included. Is there anything different from the way mentioned in the note provided by @StatDave? Please refer to the thread in which I @ you for details.
No different than if you have no interaction terms. The interaction term is simply another variable in the model.
Finally, I would like to ask on the sample size requirement of Logistic Partial Least Squares Regression. I have come to known that LASSO requires a larger sample size, yet there was no conclusion on the specific calculation formula of sample size, at least in the case in which a logistic regression model is used as a clinical prediction model. What about Logistic Partial Least Squares Regression? Is there a specific calculation formula?
I am not aware of a sample size requirement for PLS. Obviously, as with any statistical procedure, the more data you have, the better chance you have of finding a signal. I have seen PLS work well with fewer than 100 data points, but of course, there can't be a lot of noise, there must be a relatively strong signal in the data. And this paragraph would be true not just for PLS, but also for the Lasso, and Principal Components and any other model fitting method.
The formula for Logistic PLS is in the paper I linked to.
Thank you for your help again!😀
@PaigeMiller wrote:
No different than if you have no interaction terms. The interaction term is simply another variable in the model.
But there are several details I would like to consult on.
I found out that several interaction terms were statistically (and of course professionally) significant in my logistic regression model. The interaction terms were both statistically significant when I used the unstandardized or the standardized variables as independent variables in the model.
I tried with the Hessian weight building process in PROC GENMOD demonstrated in the note you provided and found out that the Hessian weights generated when I included the interaction terms were different from the Hessian weights generated when all of the interaction terms were not included. However, both results showed that collinearities exists among several independent variables and the intercept, while no collinearity was observed among the independent variables themselves.
In addition, I tried to use PROC REG to generate tolerance, VIF and condition index and found out that PROC REG does not support adding interaction terms, no matter in the format of "a|b" or "a b a*b". Therefore, despite the fact that the Hessian weights were computed with the interaction terms in the model, there was no way to take them into account when it comes to the computation of collinearity indicators (i.e. tolerance, VIF and condition index).
To change my lengthy words into a simpler question, please help me pick the correct code from the following two codes I compose.
Suppose (1) we have a, b and c as the independent variables and y as the independent variable; (2) among the variables, a*c is statistically significant; (3) collinearity exists only between one or more independent variable(s) and the intercept. Here are the following two codes of dealing with collinearity.
/*Code 1*/
ods select none;
proc genmod data=log;
model y(event='1')=a|c b/dist=binomial corrb itprint scoring=50;/*Computation of Hessian weight, with interaction term included*/
output out=col hesswgt=w;
run;
ods select collindiag collindiagnoint ParameterEstimates;
proc reg data=col;
weight w;
model y=a b c/collin collinoint vif tol;/*PROC REG does not support putting interaction terms in the model*/
run;
quit;
ods select collindiag collindiagnoint ParameterEstimates;
proc reg data=col;
model gh=a b c/collin collinoint vif tol;
run;
quit;
/*Correction of collinearity (assuming that only collinearities among independent variables and the intercept exist)*/
proc standard data=col s=0.5 out=std;
var a b c;/*PROC STANDARD does not support putting interaction terms in the var argument*/
run;
/*Check if collinearity were successfully eliminated via the variable standardization process*/
ods select ParameterEstimates ;
proc genmod data=std;
model y(event='1')=a|c b/dist=binomial corrb itprint scoring=50;
output out=col1 hesswgt=w;
run;
ods select collindiag collindiagnoint ParameterEstimates;
proc reg data=col1;
weight w;
model gh=a b c/collin collinoint vif tol;
run;
quit;
ods select collindiag collindiagnoint ParameterEstimates;
proc reg data=col1;
model gh=a b c/collin collinoint vif tol;
run;
quit;
/*Logistic regression model building*/
proc logistic data=std;
model y(event='1')=a|c b/parmlabel lackfit aggregate scale=pearson covb pcorr corrb ctable stb rsq;
roc;
run;
/*Code 2*/
ods select none;
proc genmod data=log;
model y(event='1')=a c b/dist=binomial corrb itprint scoring=50;/*Computation of Hessian weight, without including interaction term, resulting a result different from the corresponding process in Code 1*/
output out=col hesswgt=w;
run;
ods select collindiag collindiagnoint ParameterEstimates;
proc reg data=col;
weight w;
model y=a b c/collin collinoint vif tol;/*PROC REG does not support putting interaction terms in the model*/
run;
quit;
ods select collindiag collindiagnoint ParameterEstimates;
proc reg data=col;
model gh=a b c/collin collinoint vif tol;
run;
quit;
/*Correction of collinearity (assuming that only collinearities among independent variables and the intercept exist)*/
proc standard data=col s=0.5 out=std;
var a b c;/*PROC STANDARD does not support putting interaction terms in the var argument*/
run;
/*Check if collinearity were successfully eliminated via the variable standardization process*/
ods select ParameterEstimates ;
proc genmod data=std;
model y(event='1')=a c b/dist=binomial corrb itprint scoring=50;/*Whether or not including the interaction term here may have no impact on the Hessian weight, as results generated from my own data show*/
output out=col1 hesswgt=w;
run;
ods select collindiag collindiagnoint ParameterEstimates;
proc reg data=col1;
weight w;
model gh=a b c/collin collinoint vif tol;
run;
quit;
ods select collindiag collindiagnoint ParameterEstimates;
proc reg data=col1;
model gh=a b c/collin collinoint vif tol;
run;
quit;
/*Logistic regression model building*/
proc logistic data=std;
model y(event='1')=a|c b/parmlabel lackfit aggregate scale=pearson covb pcorr corrb ctable stb rsq;
roc;
run;
Which code is the correct one?
Many thanks!
You can use PROC GLMMOD to create a matrix that contains interaction (and other terms), which can then be used in PROC REG to compute VIFs and other collinearity diagnostics for a model with interactions..
Example: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_glmmod_examples01.htm
The OUTDESIGN= matrix output from PROC GLMMOD has columns for the interaction which can be used in PROC REG.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.