BookmarkSubscribeRSS Feed
StacyClark
Calcite | Level 5

I am testing if tree density differs in response to a treatment using Proc Glimmix with a negative binomial distribution. I would like to use a covariate of tree density prior to treatment in the model. Are there any diagnostics I should be using to test if my covariate should be transformed?

9 REPLIES 9
Damien_Mather
Lapis Lazuli | Level 10

The generalised linear mixed model framework of proc glimmix fundamentally encompasses covariates that would not meet otherwise necessary gaussian distributional assumptions in an ordinary least squares model framework.

 

This is not the same however as asserting that a transformation on a covariate will not improve overall model fit.

 

If this interests you, you could still usefully explore covariate transformations in the spirit of the latter, and using the proximity of the scale factor (chi sq/df) to unity, and/or with method=laplace you could compare the AICCs or -2LLs of models with competing covatiate transformations and/or also higher order polynomial terms to see if there is a minimisation on one or more particular transformations.

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Stepping back a bit: Is "tree density" a count (i.e., does "tree density" take integer values)? Typically, density is a rate: count/area. Is area the same for all experimental/observational units? I'm pondering here whether the negative binomial distribution is your best choice. 

 

If tree density is a count (and now for clarity, I'll refer to "tree density" as "tree count"), then a negative binomial distribution might be appropriate. The canonical link for the negative binomial distribution is the log. Unless you invoke a curvilinear (e.g., quadratic) model, the relationship between the response and the covariate is assumed to be linear in the GLIMMIX procedure (generalized linear mixed model), which roughly translates into a linear relationship between log(post-tree-count) and the covariate (noting of course, that a log transformation of post-tree-count is not the same as a log link). Logically, I then would consider a log transformation of pre-tree-count. Plot log(post-tree-count) against log(pre-tree-count) and visually assess the relationship. Does it look linear? You can color- or symbol-code points by other predictor variable values as well for (potentially) additional insight.

 

StacyClark
Calcite | Level 5

Thanks for your responses. I will try various transformations of the covariate to see if it improves overall model fit and to assess the visual relationship between the covariate and the y variable in the  model.

I used the NB distribution because the Poisson distribution had major overdisperssion problems. I had also tried log transforming the y variable and using a gaussian model, but that did not fix the normality problems.

StacyClark
Calcite | Level 5

The y variable was density (trees per hectare), taken on fixed radius plots of the same size. Data were collected as counts, then transformed to density using the plot size.

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Theoretically, density data cannot follow a negative binomial distribution because density data do not take integer values.

 

Unlike (at least some) packages in R, the GLIMMIX procedure accepts non-integer data in conjunction with the negative binomial distribution and appears to deliver valid results. 

 

That said, my preference for modeling density based on counts is to use an offset (which is the third model in the code below). This approach is particularly advantageous if, say, plot areas are variable.

 

dpost: post-treatment density

dpre: pre-treatment density

npost: post-treatment count

npre: pre-treatment count

_log: log transformation

 

title1 "Using density with negative binomial distribution";
proc glimmix data=have method=laplace;
    nloptions tech=nrridg;
    class trt;
    model dpost = dpre_log trt dpre_log*trt / dist=negbin;
    lsmeans trt / ilink;
    output out=glmm_out pred=pred student=student;
    run;
proc sgplot data=glmm_out;
    loess x=dpre_log y=student;
    refline 0 / axis=y;
    run;


title1 "Using count with negative binomial distribution";
proc glimmix data=have method=laplace;
    nloptions tech=nrridg;
    class trt;
    model npost = npre_log trt npre_log*trt / dist=negbin;
    lsmeans trt / ilink;
    output out=glmm_out pred=pred student=student;
    run;
proc sgplot data=glmm_out;
    loess x=npre_log y=student;
    refline 0 / axis=y;
    run;


title1 "Using count with negative binomial distribution with offset to obtain density";
proc glimmix data=have method=laplace;
    nloptions tech=nrridg;
    class trt;
    model npost = npre_log trt npre_log*trt / dist=negbin offset=area_log;
    lsmeans trt / ilink;
    output out=glmm_out pred=pred student=student;
    run;
proc sgplot data=glmm_out;
    loess x=npre_log y=student;
    refline 0 / axis=y;
    run;

 

 

StacyClark
Calcite | Level 5

Thanks for your help with this. The plots were all the same size, so density and counts should yield similar results.

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Stacy, you might find this exchange interesting. @PGStats built a nice example.

 

 

https://communities.sas.com/t5/SAS-Statistical-Procedures/Should-I-use-quot-Offset-quot-or-add-a-quo...

 

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

I'll ask, because it's a not-uncommon misconception: Are you checking normality using the observed data combined over treatment groups or using the residuals from a model? The former approach is bad, the latter approach is good. Distributional assumptions are conditional on the predictor variables.

 

Something of an aside: If the number of trees is "big enough", the Poisson distribution can be approximated by a normal distribution. Even if the data within each treatment group are normally distributed, the treatment groups could have different variances. Unequal variances might be accommodated with a log transformation or a heterogeneous variances model, as well as other approaches. So a model invoking a normal distribution might work, or it might not, depending on the data. If the counts are substantially overdispersed, then a model based on normal distributions might be less likely to be satisfactory.

StacyClark
Calcite | Level 5

My understanding is that normality should be checked with residuals from the model.

I had initially tried a Gaussian distribution, but normality was a major problem that was not overcome with transformation. A Poisson distribution had major overdispersion problems.

 

Thank you.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 3009 views
  • 1 like
  • 3 in conversation