BookmarkSubscribeRSS Feed
jsberger
Obsidian | Level 7

I have a hierarchically structured data set with count outcomes across approximately 75,000 records (nested within about 68 level-2 units).  I've utilized completely unconditional models to explore model fit, looking at poisson, negative binomial, and the zero-inflated flavors of both of these.  Based on model fit values, graphical depictions, etc., the data appear to be zero-inflated negative binomial. Now, of course, i would like to proceed exploring related predictors of both the zero-inflated portion and the negative binomial processes. 

I have about 10 predictors that i want to explore, based on both theory and use of a multi-level logistic and negative binomial models via glimmix to establish ballpark starting values. I submitted a model with these predictors, including starting values, a few days ago and it's still churning.  Am i asking too much (i.e. the model is too complex) given the size of the data?  Is there any way i can speed things up?

5 REPLIES 5
SteveDenham
Jade | Level 19

Have you looked at the SAS-L archives for NLMIXED posts, especially those by Dale McLerran, Matt Zack and David Cassell?

I am going to guess that reparameterization so that boundary effects are not as noticeable is what needs to be done.  Also worrisome would be some sort of quasi-separation for the probability of a zero across the level-2 units.  If that happens, well, there is the R package glmer(), but I don't think you would be happy with that either, as I don't think you will be able to load everything into RAM, given the data and the possible size of the matrixes involved.

So, if reparameterizing doesn't work, then you might have to go for a Bayesian approach that uses Monte Carlo methods.  What is the parameterization you are currently using, if you don't mind sharing?

Steve Denham

jsberger
Obsidian | Level 7

Here's the code i've been using...any insights for improvements are greatly appreciated:

proc nlmixed data=mtest.&fn method=gauss qpoints=50;

  parms a0=1.6549

   asex=-.7043

   aasian=.5559

   ablack=-1.3403

   ahisp=-.7020

   aamind=-1.0654

   amulti=-.8059

   aec=-.4590

   alep=.2021

   aoverage=-.4825

   usiga=.7648

   mu_a0=0

   b0=-6.8186

   bsex=.9558

   basian=-.5166

   bblack=1.5365

   bhisp=.7507

   bamind=1.2369

   bmulti=.9586

   bec=.5113

   blep=-.2454

   boverage=.7596

   mu_b0=0

   usigb=.6051

   cov_uab=0

   k=1;

   zinb_p=min(max(1/(1 + exp(-(a0 + u0a))), 1E-200), 1-1E-200);

          eta_nb=b0 + bsex + basian + bblack + bhisp + bamind + bmulti + bec + blep + boverage + u0b + ln;

   etaxB=b0 + bsex + basian + bblack + bhisp + bamind + bmulti + bec + blep + boverage + ln;

   etaxBzG=b0 + bsex + basian + bblack + bhisp + bamind + bmulti + bec + blep + boverage + u0b + ln;

          lambda = exp(eta_nb);

          p0=zinb_p+(1-zinb_p)*exp(-(numdaysoss+(1/k))*log(1+k*lambda));

          p_else=(1-zinb_p)*exp(lgamma(numdaysoss+(1/k)) - lgamma(numdaysoss+1) - lgamma(1/k) +numdaysoss*log(k*lambda) - (numdaysoss+(1/k))*log(1+k*lambda));

          if numdaysoss=0 then loglike = log(p0);

          else loglike = log(p_else);

          model numdaysoss ~ general(loglike);

  random u0a u0b ~ normal([mu_a0, mu_b0], [usiga, cov_uab, usigb]) subject=school_id out=rand_zinb (keep=school_id effect estimate);

  run;

SteveDenham
Jade | Level 19

There is one parameter I have a question about--ln, appearing in the three eta definitions.  It doesn't have a starting value in the parms statement, and so will start out equal to 1.  Is that a reasonable value?  What does it estimate?  Probably won't help the convergence problem much, but...

Also what are all of the "a" parameters used for?  It appears that only a0 appears in any of the expressions (defining zinp_b).  You may need to include these in the zinp_b definition to match what your source data came up with.  Again, not real sure that it will help the convergence.

And while I don't have ANY experience with it, the OPTCHECK option on the NLMIXED statement might help move things along, but I am guessing about that, based solely on how it is described in the documentation.

Steve Denham

jsberger
Obsidian | Level 7

Sorry, the 'a' variables should all be included on the zinb_p definition.  These model the zero inflation portion.  The ln variable is the offset, or the log of the days enrolled in school (here in my example), as some subjects have differing number of days that they can be suspended.  I'll look into the optcheck feature.

SteveDenham
Jade | Level 19

If it is an offset, ln(median days suspended) is a good initial value.  Just don't let it overload the exponentiation.  Since all of the other parameters sum to about -1, I don't think this is the problem.  I think the optimization is having a hard time with some schools having a much higher rate of zero inflation than others.  If this were a strictly fixed effects situation, I might suggest finite mixture models, mixing a negative binomial with a constant, as in a hurdle model.  But the standard errors won't work--everything will be dominated by the schools with a greater number of observations.

Steve Denham

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1720 views
  • 0 likes
  • 2 in conversation