proc logistic
in SAS with around 3.6 million observations, an outcome with 5 levels, and dozens of categorical predictors. I had no issue running both univariate and multivariate models when setting param = ref
.However, once I tried param = glm
, it started giving the warning message of "The information matrix is singular and thus the convergence is questionable. specifying a larger SINGULAR= value." in multivariate models. After doing some research, I found this message suggesting a multicollinearity issue in the model. I then tried to use only 2 predictors and it still gave the message while the correlation matrix showed no correlation between the two predictors.
As far as I know, the only difference of param = ref
and param = glm
is that param = glm
uses less-than-full-rank reference coding, meaning that it will create k-1
dummy variables given k
levels in the categorical predictor. These two parametrization methods should generate the same log-likelihood and estimates given the same reference level. To confirm this, I also compared the result of the two models using only 2 predictors. While param = glm
throwing a warning, the result is identical to param = ref
(Except a bunch of zeros in the estimates of reference levels for each predictor in param = glm
, is it the cause?).
My question is, why did the param = glm
model throwing a warning while param = ref
did not. And more importantly, in this situation, should I trust the result of the param = ref
even though no warning was displayed.
I appreciate any advice and suggestions. Thank you in advance.
Any change in the model, such as parameterization of CLASS effects, changes the optimization, so unexpected differences like this can happen. But to assess the fit from the PARAM=REF fit, you can add the ITPRINT option and examine the vector of gradients. For proper convergence, they should all be quite close to zero. Also examine the standard errors of the parameters - they should not be large, like approaching 100 or even more. If you want more assurance, you could use any of the other procedures that can fit the logistic model such as the GLIMMIX, GENMOD, HPGENSELECT, PROBIT procedures and others which generally don't have identical algorithm code.
Thank you for the reply! I've used the ITPRINT option, and the gradients seemed to approach zero at the end. I'll try other options you suggested.
You could use different value of Y variable to check which one make such annoying WARNNING message.
model Smoker_NXT(ref='No') = AgeStartCIGS Age1stIview Sex Race Hispanic Wave|Smoker|ENDSer SmkHistory Start2SMK / noint link=glogit Singular=1E-7 ;
model Smoker_NXT(ref='Yes') = AgeStartCIGS Age1stIview Sex Race Hispanic Wave|Smoker|ENDSer SmkHistory Start2SMK / noint link=glogit Singular=1E-7 ;
model Smoker_NXT(ref='None') = AgeStartCIGS Age1stIview Sex Race Hispanic Wave|Smoker|ENDSer SmkHistory Start2SMK / noint link=glogit Singular=1E-7 ;
......................
So the reference level also makes the difference? Will definitely try it. Thank you.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.