Re: Slow nlmixed estimation

jsberger · Posted 12-30-2013 06:31 AM

I have a hierarchically structured data set with count outcomes across approximately 75,000 records (nested within about 68 level-2 units). I've utilized completely unconditional models to explore model fit, looking at poisson, negative binomial, and the zero-inflated flavors of both of these. Based on model fit values, graphical depictions, etc., the data appear to be zero-inflated negative binomial. Now, of course, i would like to proceed exploring related predictors of both the zero-inflated portion and the negative binomial processes.

I have about 10 predictors that i want to explore, based on both theory and use of a multi-level logistic and negative binomial models via glimmix to establish ballpark starting values. I submitted a model with these predictors, including starting values, a few days ago and it's still churning. Am i asking too much (i.e. the model is too complex) given the size of the data? Is there any way i can speed things up?

SteveDenham · Posted 12-30-2013 12:22 PM

Have you looked at the SAS-L archives for NLMIXED posts, especially those by Dale McLerran, Matt Zack and David Cassell?

I am going to guess that reparameterization so that boundary effects are not as noticeable is what needs to be done. Also worrisome would be some sort of quasi-separation for the probability of a zero across the level-2 units. If that happens, well, there is the R package glmer(), but I don't think you would be happy with that either, as I don't think you will be able to load everything into RAM, given the data and the possible size of the matrixes involved.

So, if reparameterizing doesn't work, then you might have to go for a Bayesian approach that uses Monte Carlo methods. What is the parameterization you are currently using, if you don't mind sharing?

Steve Denham

jsberger · Posted 12-30-2013 03:44 PM

Here's the code i've been using...any insights for improvements are greatly appreciated:

proc nlmixed data=mtest.&fn method=gauss qpoints=50;

parms a0=1.6549

asex=-.7043

aasian=.5559

ablack=-1.3403

ahisp=-.7020

aamind=-1.0654

amulti=-.8059

aec=-.4590

alep=.2021

aoverage=-.4825

usiga=.7648

mu_a0=0

b0=-6.8186

bsex=.9558

basian=-.5166

bblack=1.5365

bhisp=.7507

bamind=1.2369

bmulti=.9586

bec=.5113

blep=-.2454

boverage=.7596

mu_b0=0

usigb=.6051

cov_uab=0

k=1;

zinb_p=min(max(1/(1 + exp(-(a0 + u0a))), 1E-200), 1-1E-200);

eta_nb=b0 + bsex + basian + bblack + bhisp + bamind + bmulti + bec + blep + boverage + u0b + ln;

etaxB=b0 + bsex + basian + bblack + bhisp + bamind + bmulti + bec + blep + boverage + ln;

etaxBzG=b0 + bsex + basian + bblack + bhisp + bamind + bmulti + bec + blep + boverage + u0b + ln;

lambda = exp(eta_nb);

p0=zinb_p+(1-zinb_p)*exp(-(numdaysoss+(1/k))*log(1+k*lambda));

p_else=(1-zinb_p)*exp(lgamma(numdaysoss+(1/k)) - lgamma(numdaysoss+1) - lgamma(1/k) +numdaysoss*log(k*lambda) - (numdaysoss+(1/k))*log(1+k*lambda));

if numdaysoss=0 then loglike = log(p0);

else loglike = log(p_else);

model numdaysoss ~ general(loglike);

random u0a u0b ~ normal([mu_a0, mu_b0], [usiga, cov_uab, usigb]) subject=school_id out=rand_zinb (keep=school_id effect estimate);

run;

SteveDenham · Posted 12-31-2013 08:28 AM

There is one parameter I have a question about--ln, appearing in the three eta definitions. It doesn't have a starting value in the parms statement, and so will start out equal to 1. Is that a reasonable value? What does it estimate? Probably won't help the convergence problem much, but...

Also what are all of the "a" parameters used for? It appears that only a0 appears in any of the expressions (defining zinp_b). You may need to include these in the zinp_b definition to match what your source data came up with. Again, not real sure that it will help the convergence.

And while I don't have ANY experience with it, the OPTCHECK option on the NLMIXED statement might help move things along, but I am guessing about that, based solely on how it is described in the documentation.

Steve Denham

jsberger · Posted 12-31-2013 11:36 AM

Sorry, the 'a' variables should all be included on the zinb_p definition. These model the zero inflation portion. The ln variable is the offset, or the log of the days enrolled in school (here in my example), as some subjects have differing number of days that they can be suspended. I'll look into the optcheck feature.

SteveDenham · Posted 12-31-2013 12:25 PM

If it is an offset, ln(median days suspended) is a good initial value. Just don't let it overload the exponentiation. Since all of the other parameters sum to about -1, I don't think this is the problem. I think the optimization is having a hard time with some schools having a much higher rate of zero inflation than others. If this were a strictly fixed effects situation, I might suggest finite mixture models, mixing a negative binomial with a constant, as in a hurdle model. But the standard errors won't work--everything will be dominated by the schools with a greater number of observations.

Steve Denham