BookmarkSubscribeRSS Feed
sasnewbie12
Obsidian | Level 7

I have run a huge logistic regression with about 900 independant variables in the model. All variable sin the model, including the dependant are binary 1 or 0. 

 

The log states that:

WARNING: The information matrix is singular and thus the convergence is questionable.

 

while also stating that:

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

 

I am just using this model to identify potentially significant independant factors that predict the dependant outcome; I will then use those significant variables in further modeling with other covariates not included in this model. 

 

Therefore, do I have to fix this issue by potentially removing variables (perhaps there is collinearity?), or can I rely on the model?

It would be difficult to try and pick and choose from 900 variables. 

7 REPLIES 7
PaigeMiller
Diamond | Level 26

sasnewbie12 wrote:

 

Therefore, do I have to fix this issue by potentially removing variables (perhaps there is collinearity?), or can I rely on the model?

It would be difficult to try and pick and choose from 900 variables. 


Not "perhaps". There is collinearity. As in, one (or more) of the 900 variables is a perfect linear combination of the others.

 

I wouldn't do this. Even if you can trust the model (which you probably can't), logistic regression is a poor choice of technique when you have 900 correlated variables.

 

Better you should use a technique which is much less affected by the presence of collinearity. That method is Partial Least Squares regression, which in SAS is PROC PLS.

--
Paige Miller
sasnewbie12
Obsidian | Level 7

This is also survey data. I don' t think there is any proc for PLS with survey data. 

PaigeMiller
Diamond | Level 26

That doesn't change any of my comments. Logistic regression in this case is a nightmare. The collinearity will make your results meaningless.

 

You could modify the data to weight things as the survey requires, and then run PROC PLS.

--
Paige Miller
Ksharp
Super User
Since you have a huge variables for logistic regression,
I suggest you use PROC HPGENSELECT to select the most significant dozen of variables.

PaigeMiller
Diamond | Level 26

In my opinion, HPGENSELECT fails for the same reason as LOGISTIC, it is not meant to account for the collinearity of the 900 variables. Forward and stepwise methods are widely regarded by the statistical community as having major drawbacks.

--
Paige Miller
Ksharp
Super User
There are other selection method like LASSO, CV ..... in PROC HPGENSELECT .


PaigeMiller
Diamond | Level 26

With regards to Lasso, there is this long thread in which many people think Lasso is not a good choice with large number of correlated variables. https://stats.stackexchange.com/questions/7935/what-are-disadvantages-of-using-the-lasso-for-variabl...

 

I don't know enough about CV to comment.

--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 3029 views
  • 0 likes
  • 3 in conversation