I have a hierarchically structured data set with count outcomes across approximately 75,000 records (nested within about 68 level-2 units). I've utilized completely unconditional models to explore model fit, looking at poisson, negative binomial, and the zero-inflated flavors of both of these. Based on model fit values, graphical depictions, etc., the data appear to be zero-inflated negative binomial. Now, of course, i would like to proceed exploring related predictors of both the zero-inflated portion and the negative binomial processes.
I have about 10 predictors that i want to explore, based on both theory and use of a multi-level logistic and negative binomial models via glimmix to establish ballpark starting values. I submitted a model with these predictors, including starting values, a few days ago and it's still churning. Am i asking too much (i.e. the model is too complex) given the size of the data? Is there any way i can speed things up?
Have you looked at the SAS-L archives for NLMIXED posts, especially those by Dale McLerran, Matt Zack and David Cassell?
I am going to guess that reparameterization so that boundary effects are not as noticeable is what needs to be done. Also worrisome would be some sort of quasi-separation for the probability of a zero across the level-2 units. If that happens, well, there is the R package glmer(), but I don't think you would be happy with that either, as I don't think you will be able to load everything into RAM, given the data and the possible size of the matrixes involved.
So, if reparameterizing doesn't work, then you might have to go for a Bayesian approach that uses Monte Carlo methods. What is the parameterization you are currently using, if you don't mind sharing?
Steve Denham
Here's the code i've been using...any insights for improvements are greatly appreciated:
proc nlmixed data=mtest.&fn method=gauss qpoints=50;
parms a0=1.6549
asex=-.7043
aasian=.5559
ablack=-1.3403
ahisp=-.7020
aamind=-1.0654
amulti=-.8059
aec=-.4590
alep=.2021
aoverage=-.4825
usiga=.7648
mu_a0=0
b0=-6.8186
bsex=.9558
basian=-.5166
bblack=1.5365
bhisp=.7507
bamind=1.2369
bmulti=.9586
bec=.5113
blep=-.2454
boverage=.7596
mu_b0=0
usigb=.6051
cov_uab=0
k=1;
zinb_p=min(max(1/(1 + exp(-(a0 + u0a))), 1E-200), 1-1E-200);
eta_nb=b0 + bsex + basian + bblack + bhisp + bamind + bmulti + bec + blep + boverage + u0b + ln;
etaxB=b0 + bsex + basian + bblack + bhisp + bamind + bmulti + bec + blep + boverage + ln;
etaxBzG=b0 + bsex + basian + bblack + bhisp + bamind + bmulti + bec + blep + boverage + u0b + ln;
lambda = exp(eta_nb);
p0=zinb_p+(1-zinb_p)*exp(-(numdaysoss+(1/k))*log(1+k*lambda));
p_else=(1-zinb_p)*exp(lgamma(numdaysoss+(1/k)) - lgamma(numdaysoss+1) - lgamma(1/k) +numdaysoss*log(k*lambda) - (numdaysoss+(1/k))*log(1+k*lambda));
if numdaysoss=0 then loglike = log(p0);
else loglike = log(p_else);
model numdaysoss ~ general(loglike);
random u0a u0b ~ normal([mu_a0, mu_b0], [usiga, cov_uab, usigb]) subject=school_id out=rand_zinb (keep=school_id effect estimate);
run;
There is one parameter I have a question about--ln, appearing in the three eta definitions. It doesn't have a starting value in the parms statement, and so will start out equal to 1. Is that a reasonable value? What does it estimate? Probably won't help the convergence problem much, but...
Also what are all of the "a" parameters used for? It appears that only a0 appears in any of the expressions (defining zinp_b). You may need to include these in the zinp_b definition to match what your source data came up with. Again, not real sure that it will help the convergence.
And while I don't have ANY experience with it, the OPTCHECK option on the NLMIXED statement might help move things along, but I am guessing about that, based solely on how it is described in the documentation.
Steve Denham
Sorry, the 'a' variables should all be included on the zinb_p definition. These model the zero inflation portion. The ln variable is the offset, or the log of the days enrolled in school (here in my example), as some subjects have differing number of days that they can be suspended. I'll look into the optcheck feature.
If it is an offset, ln(median days suspended) is a good initial value. Just don't let it overload the exponentiation. Since all of the other parameters sum to about -1, I don't think this is the problem. I think the optimization is having a hard time with some schools having a much higher rate of zero inflation than others. If this were a strictly fixed effects situation, I might suggest finite mixture models, mixing a negative binomial with a constant, as in a hurdle model. But the standard errors won't work--everything will be dominated by the schools with a greater number of observations.
Steve Denham
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.