BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Kastchei
Pyrite | Level 9

Hi all,

 

I have a multinomial regression with a random intercept that I'm modeling with GLIMMIX.  I have run it with method = MMPL, Laplace, and Quad.  All converge, though Quad taking nearly 12 hours.  Laplace and Quad both give a warning that the maximum element of the gradient is > 0.001.  The max gradients for the three methods are 0.000003969, 0.003263, and 0.13639.  The results are attached.

 

The ORs for Laplace are slightly further from 1 than MMPL, and Quad goes way further.  In the past, I've always prefered to use quad when possible, but perhaps in this case I should be using MMPL or Laplace.  How important is the maximum gradient element?  

 

Thanks for any insight.

 

Warm regards,

Michael

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

The question of how troubling is a max gradient large enough to warrant printing something in the output is, well, troublesome.  It does mean that you could improve the estimates, but that you probably don't have enough data to actually move enough to change the value of the convergence criteria.  So maybe not so troubling.

 

On the other hand, you ought to look at the paper by Stroup and Claassen. This is behind a paywall, but is easily accessible for ASA members:

Stroup, W., Claassen, E. Pseudo-Likelihood or Quadrature? What We Thought We Knew, What We Think We Know, and What We Are Still Trying to Figure Out. JABES (2020). https://doi.org/10.1007/s13253-020-00402-6

 

The point they make is that unless you have a lot of replicates, the maximum likelihood estimates of the variance components are biased enough to lead to poor Type I error rates.  In light of that paper, what happens if you use the default RSPL (subject level REML by linearization)?

 

If your only 3 candidates are MMPL, Laplace and Gaussian quadrature, I would pick quad, even though you have the largest remaining gradient and it took so long.  But if the RSPL "works", I would really be tempted to use those results, and cite the Stroup and Claassen paper.

 

SteveDenham

View solution in original post

8 REPLIES 8
SteveDenham
Jade | Level 19

The question of how troubling is a max gradient large enough to warrant printing something in the output is, well, troublesome.  It does mean that you could improve the estimates, but that you probably don't have enough data to actually move enough to change the value of the convergence criteria.  So maybe not so troubling.

 

On the other hand, you ought to look at the paper by Stroup and Claassen. This is behind a paywall, but is easily accessible for ASA members:

Stroup, W., Claassen, E. Pseudo-Likelihood or Quadrature? What We Thought We Knew, What We Think We Know, and What We Are Still Trying to Figure Out. JABES (2020). https://doi.org/10.1007/s13253-020-00402-6

 

The point they make is that unless you have a lot of replicates, the maximum likelihood estimates of the variance components are biased enough to lead to poor Type I error rates.  In light of that paper, what happens if you use the default RSPL (subject level REML by linearization)?

 

If your only 3 candidates are MMPL, Laplace and Gaussian quadrature, I would pick quad, even though you have the largest remaining gradient and it took so long.  But if the RSPL "works", I would really be tempted to use those results, and cite the Stroup and Claassen paper.

 

SteveDenham

Kastchei
Pyrite | Level 9

Thank you, Steve.  I always look forward to your responses.  I'll take a look at the paper.

 

I've tried all the methods.

 

RSPL and MSPL give the warning "WARNING: Pseudo-likelihood update fails in outer iteration 3" and stops.  In the iteration history, the objective function has blown up (1E154).  I tried with every nLOptions tech = to see if that would change anything, but it did not.

 

RMPL and MMPL both work.  They give nearly identical results.

 

Laplace and quad both converge with no options, but the gradients are quite large.  With gConv = 0, it runs a bit longer and gives the gradients in the initial post.  The OR estimates are somewhat to considerably further from 1 than RMPL and MMPL.  Standard errors tends to be larger as well.  I'm currently running a quadrature with 13 points rather than 9 (which was the adaptively determined value) to see if that make any difference.

 

Would you recommend going with RMPL?  I'm not sure how important the distinction between RSPL and RMPL is.

 

Warm regards,

Michael

SteveDenham
Jade | Level 19

Hi @Kastchei 

Given what I read in the paper, I suspect that RMPL would be superior to the other methods.  You nailed one of the two big problems with RSPL - the subject level linearization leads to issues of calculation with the multinomial distribution.  So the marginal method, where the linearization is around the mean per group, sure seems like the way to go.  It shares the second issue with the RSPL method in that you can't get information criteria as the final data are pseudo-data, and thus depend on the selection of the covariance structure.

 

SteveDenham

Kastchei
Pyrite | Level 9

Hi Steve,

 

I read the paper.  Thank you for mentioning it.  I would like to make sure I'm understanding the paper correctly.

 

My data has a very large number of subjects (S > 16k), but a small and varying number of measurements per subject: N = 1 to 9, most being towards the smaller side.

# of measurements # of subjects
1 2,973
2 2,667
3 2,586
4 2,771
5 2,437
6 1,739
7 891
8 274
9 42

From my reading of the paper, especially the comments on Figure 1, the small N would suggest f(y|b) is fairly skewed (granted they were looking at binomial rather than multinomial data).  In their suggestions (Table 9), I would think that places my study in the top right category [f(y|b) skewed, S large].  They suggest

"Use quadrature—but take care to match GLMM to data generating process to the extent possible—model misspecification matters"

However, this is a retrospective study, not a designed experiment, which the paper focuses on, so it's not really possible to match the data generating process.  The paper does not explicitly give an alternative if the process cannot be matched, but I assume they would suggest to use psedo-likelihood (or would they suggest no approach is good?).  Am I understanding this correctly?

 

As a follow-up, it seems like the take away from the paper might be that integral approximation should only be reserved for prospective studies with good experimental design, where the model can be correctly specified. 

 

Also, would you mind taking a brief moment to explain "the subject level linearization leads to issues of calculation with the multinomial distribution"?  Does binomial data also have this issue, but just to a manageable extent?

 

Warm regards,

Michael

 

SteveDenham
Jade | Level 19

Hi Michael,

Retrospective vs. prospective: Can you share the model you are using, or at least the RANDOM statement?  That would give us something regarding the number of replications per level of the random variable(s), and maybe move to a different quadrant of their table.  However, I think this is a case for staying with quadrature.  The model you are using is very likely to be correctly specified - some fixed factors you are interested in, some others that are observed and might be useful for subsetting, and some random effects, with the dependent variable being multinomial.  Hard to misspecify the process by my account.

 

Linearization of binomial/Bernoulli vs multinomial: I am just relatively unhappy with this for anything beyond a moderate sized (~1000 records) dataset.  I doubt it occurs much for binomial responses unless you are talking rare events. I bet there is a way around the outer iteration problem you have (say a conjugate gradient method, that doesn't do an inner and an outer iteration) or relaxing the singularity testing.

 

The site is being pretty wonky today, so I hope this gets posted.

 

SteveDenham

Kastchei
Pyrite | Level 9

Sure!  Here's the code.

 

proc glimmix data = cogFunc._11_Model method = RMPL    gradient itDetails nObsDetail cholesky initGLM initIter = 1000;
    nlOptions maxiter = 5000;

class outcomeClassN (ref = '0') exposureN CCEN genderN raceNotHispN smokeN educationN occupationN languageN employmentN; model outcomeClassN = exposureN yearsSince911 CCEN age_Visit genderN raceNotHispN smokeN educationN occupationN languageN employmentN / distribution = multinomial link = gLogit solution cl intercept oddsRatio ddfm = residual; random intercept / subject = ID group = outcomeClassN; * Means and pairwise estimates. *; estimate '3 = High vs. 1 = Low' exposureN -1 0 1, '2 = Intermediate vs. 1 = Low' exposureN -1 1 0, '3 = High vs. 2 = Intermediate' exposureN 0 -1 1 / byCategory cl exp iLink e adjust = simulate(seed = 68176 acc = 0.0001) stepdown(type = logical); run; quit;

And here's the beginning of the output.

 

The GLIMMIX Procedure
Model Information
Data Set COGFUNC._11_MODEL
Response Variable outcomeClassN
Response Distribution Multinomial (nominal)
Link Function Generalized Logit
Variance Function Default
Variance Matrix Blocked By id
Estimation Technique Residual MPL
Degrees of Freedom Method Residual


Class Level Information
Class Levels Values
outcomeClassN 4 0 1 2 3
exposureN 3 1 2 3
CCEN 5 1 2 3 4 5
genderN 2 0 1
raceNotHispN 4 2 3 4 5
smokeN 3 0 2 4
educationN 5 1 2 3 4 5
occupationN 3 47 999 3355
languageN 2 2 999
employmentN 2 1 2


Number of Observations Read 58575
Number of Observations Used 58575
Missing Response 0
Missing Fixed Effects 0
Missing Random Effects 0
Missing Subject Effects 0
Missing Group Effects 0
Negative Response 0


Response Profile
Ordered
Value
outcomeClassN Total
Frequency
1 0 23343
2 1 23846
3 2 7283
4 3 4103
In modeling category probabilities, outcomeClassN='0' serves as the reference category.


Dimensions
G-side Cov. Parameters 4
Columns in X 96
Columns in Z per Subject 4
Subjects (Blocks in V) 16380
Max Obs per Subject 9


Optimization Information
Optimization Technique Dual Quasi-Newton
Parameters in Optimization 4
Equality Constraints 1
Lower Boundaries 4
Upper Boundaries 1
Fixed Effects Profiled
Starting From GLM estimates
SteveDenham
Jade | Level 19

This looks good.  The thing to remember is that because the linearization is around the mean that the estimates (lsmeans, ORs, etc.) are going to be less likely to be out toward the extremes.  My brain is a bit sluggish this morning, and I don't recall why, but a visit to the documentation makes it clear that RMPL expands the linearization around the expected value of the random effects (gamma = 0).  With only one random effect in your model, I doubt that the marginal estimates obtained are badly biased in any way.

 

SteveDenham

Kastchei
Pyrite | Level 9

Thank you, Steve, for all your help!  I've been running the models using quad over the weekend, so luckily they are almost done.

 

Warm regards,

Michael

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 3288 views
  • 5 likes
  • 2 in conversation