I am getting the below error while attempting to check for collinearity using PROC REG (as discussed here: Collinearity Diagnostics)
ERROR: Eigenvalues failed in collinear option.
I have 387 numeric independent variables in total; and when I only use a subset, I do not get that error. Any advice on how to check for collinearity among all of those variables?
Furthermore, I will be modeling all variables that pass my collinearity check on a dependent variable with values 0 & 1 using PROC LOGISTIC; and I'm concerned I will have a similar problem if there is a variable limit for PROC REG.
Thanks,
-Lee
The variable limit isn't from proc reg it's from the concept of regression. If you have more unknowns than data points you can't solve the equation, for the parameter estimates.
How many observations does your data have?
The variable limit isn't from proc reg it's from the concept of regression. If you have more unknowns than data points you can't solve the equation, for the parameter estimates.
How many observations does your data have?
My current dataset is about 21K observations. However, I am debugging the code right now; I plan to run on a much larger dataset, maybe 2-2.5M observations.
Look into proc varclus perhaps?
Just to close out this thread, the short answer to my question is to use more observations. When I ran PROC REG with my 2M+ dataset, I did not get the error and was able to identify/eliminate collinear variables.
I guess that "works" if you have more observations that you can use, which isn't the case for everyone. It "works" only in the sense that SAS can now do the mathematical calculations and you don't get the error, it doesn't work in the sense that you get good estimates or predictions.
The problem with 387 prdictor variables remains. The problem is that these 387 predictor variables are still partially correlated with one another, possibly highly correlated with one another, and this causes regression to produce predictions and parameter estimates with very high mean square error, meaning that they are probably not good predictions and estimates. In which case, partial least squares provides better (lower mean square error, often dramatically lower mean square error) estimates and predictions.
400 variables shouldn't present a problem for SAS regression procedures. I can't imagine why the eigenvalue computation is failing. I've never seen that error message before, but I don't think you should ignore it. It is probably telling you something important about your variables. I'd try to determine what variables are collinear. Maybe look into PROC CORR or PROC PRINCOMP?
While the "hard" limits have been discussed already, there are practical limits (or perhaps I should say drawbacks) about using 387 predictor variables in a model. Even if they do not show exact collinearity, they may show partial collinearity, in other words some of the X-variables are highly (but not perfectly) correlated with other X-variables. In that case, regression is a poor choice to use to fit the model and make predictions; partial least squares is a better method in the sense that it has been shown to produce model predictions and parameter estimates that have lower mean squared error than you would get if you used regression. Also, if you use partial least squares, the issue of exact collinearity goes away, PLS doesn't care if there are multiple variables showing exact collinearity.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.