a month ago
I am using proc logistic and I have high correlation between two input vars: var1 and var2.
These vars are continuos. I have the effect of colinearity in my model.
I want to know if using ranges can be a solution. I convert var1 and var2 to discrete vars using ranges,
then in the model I use the vars as a classification vars.
I would like to know if this is a good solution to reduce colineatity in my model.
Any advice or other solution will be greatly appreciated.
a month ago - last edited a month ago
Using ranges of continuous variables is rarely a good solution to any problem, in my opinion. And I don't see how using ranges eliminates the problem of multicollinearity, its still there, you are just masking it by creating ranges, and creating other problems by creating ranges.
The problem of collinearity between predictor variables is not one that can be "solved", in the sense that collinearity exists, and you will not be able to understand or analyze the data as if the collinearity does not exist. All algorithms that you might try will be affected by this collinearity.
In ordinary least squares regression, the collinearity causes the estimates of slopes and interecpt to have much higher root mean square errors, so high in fact that term in the model could have the wrong sign. I haven't seen a study about what happens when you have collinearity in logistic regression, but I would expect similar things will happen in the presence of collinearity in logistic regression. Thus, the question really is NOT how to eliminate or reduce the multicollinearity, but what methods produce the lowest root mean square errors of slopes and intercept. According to a paper by Frank and Friedman, the algorithm that produces the lowest root mean square errors (in most cases) is called Partial Least Squares and so that would your best choice in the case of collinearity (which is PROC PLS in SAS). There is a logistic version of Partial Least Squares, here: https://cedric.cnam.fr/fichiers/RC906.pdf