Solved: Fit Statistics Missing when negative binomial distribution is used for...

aroebuck · Posted 11-29-2017 11:53 PM

I'm currently using PROC GLIMMIX in SAS 9.4 trying to compare gene copy numbers from groundwater (Non-Gaussian). For the most part, using a lognormal distribution worked, but in a few cases this analysis showed no significant differences when I really believe there should be. I tried a negative binomial distribution (coding below) to see what changes and while the differences make more sense, the fit statistics (AIC, AICC, etc.) are missing from the output when I looked to compare distributions. I'm just wondering if this is because the model is not appropriate or if the fit statistics obtained are different depending on the distribution. My Generalized Chi-Square and Gen.Chi Square/DF are also coming out as zero which I don't believe is expected (Image attached).

Does anyone know what I should look for or what I can try to make sure this analysis is working correctly?

Thanks!

title2 'cDNA: Differences Between Dates';
proc glimmix data=first;
class well date;
model cDNA = well date date*well / ddfm=kr dist=negbin link=log;
random _residual_ / subject =well;
lsmeans date*well / tdiff adjust=tukey lines; 
output out=second predicted=pred residual=resid residual (noblup)=mresid student=studentresid student(noblup)=smresid;
run;

sld · Posted 11-30-2017 12:04 PM

Not the easiest statistical model to start with, but some times you just have to dive into the deep end. In addition to the Stroup text I linked to in my previous message, I highly recommend

https://www.sas.com/store/prodBK_59882_en.html

which will (sometime) soon have the equivalent of a 3rd ed:

SAS® for Mixed Models: An Introduction
By Walter W. Stroup, Ph.D., George A. Milliken, Ph.D., Elizabeth A. Claassen and Russell D. Wolfinger, Ph.D.
Anticipated publication date: First quarter 2018

What is "block" and how does it fit into your study design?

Are you specifically interested in just these 3 wells? (Alternatively, you could think of these 3 wells as a random sample from a statistical population of wells to which you would like to make inference. Of course, that would be a pretty small sample.)

Transforming a count to a concentration by dividing each count by a constant does not change the information contained in the cDNA variable--it just rescales it, and all other things equal, multiplying/dividing by a constant has no effect on the results of statistical tests. So although the rescaled values are no longer integers, at its heart, the variable is still a count. That doesn't mean that you necessarily have to use a discrete distribution. Distributions for counts (e.g., Poisson and negative binomial) are particularly useful when counts are small; and for those distributions, the variance implicitly increases as the mean increases. Your counts are not obviously small (some are huge), and surely variance increases with the mean; I could see Poisson or NB as possibilities, or lognormal, and the choice would depend upon the data characteristics. You apparently have only 24 observations, so the choice may never be obvious.

If you have not done so already, plot cDNA against date for each well for a visual assessment of the effect of date and how that effect may change over wells. You probably have to work block into this plotting, but I don't know yet what block is, so I can't say how you might need to do that.

View solution in original post

sld · Posted 11-30-2017 12:52 AM

Check the Model Information table to see what estimation method was used. A model with a random statement and a non-normal distribution will use a pseudo-likelihood method by default

http://documentation.sas.com/?docsetId=statug&docsetTarget=statug_glimmix_details66.htm&docsetVersio...

and by default IC statistics are not computed for pseudo-likelihood

http://documentation.sas.com/?docsetId=statug&docsetTarget=statug_glimmix_details76.htm&docsetVersio...

Ideally, the distribution of choice would be logically consistent with the inherent nature of the response. The lognormal distribution is appropriate for a response measured on a continuous-scale. The negative binomial response is appropriate for a count response. What is cDNA, what is its measurement scale, and what range of values does it take? "gene copy numbers" suggests a count.

Is "date" measured repeatedly on the same "well"? How many dates are there? We would benefit from a more detailed description of your study design.

I do not like the looks of your RANDOM statement. Generally, you would not use "random _residual_" for a negative binomial distribution because the negative binomial distribution already and implicitly estimates a scale parameter. To be honest, I have no idea what the impact of " / subject=well" would be. I'm guessing not good, but I could be overlooking something. From where did you get this syntax idea? (If it came from an example assuming a normal distribution, then it is not applicable to a negative binomial distribution.) If date is a repeated measures factor, then you can use a GLMM or a GEE, but you have to choose one or the other and specify appropriate code syntax.

An excellent resource for GLMMs (and to a lesser extent GEEs) in SAS is the text by Walt Stroup

https://www.amazon.com/Generalized-Linear-Mixed-Models-Applications/dp/1439815127/ref=sr_1_1?ie=UTF8...

(It would be an excellent holiday present.) These are complicated, persnickety, and treacherous models; and attempting them casually usually is asking for trouble.

Apart from the fact that I do not like your current model...the zero values for Generalized Chi-Square and Gen.Chi Square/DF could be just rounding. Note that the Estimate for Residual in the Covariance Parameters table is very small.

aroebuck · Posted 11-30-2017 10:54 AM

Thanks very much for the prompt reply! I have now managed to get the Fit Statistics. cDNA is copies of genes in a sample. The copies, initially would be a count, but I express them as copies/100mL groundwater. The values have a huge range (they can be anywhere between around 10 copies to billions of copies). "Date" is actually meant to represent different seasons and thus I included it as a fixed effect. The same wells are sampled each date, but for testing to see if seasons have an effect on gene copies. "Well" is also a treatment since the wells have varying degrees of contamination. I see what you mean about using /subject=well now. I had included it because the well is also the unit I am measuring. Now that I think of it, would adding corresponding block values to the wells and using "/subject=block" be a better idea?

class well date block;
model cDNA = well date date*well / dist=negbin link=log;
random intercept / subject =block;
lsmeans date*well / tdiff adjust=tukey lines;  
output out=second predicted=pred residual=resid residual (noblup)=mresid student=studentresid student(noblup)=smresid;
run;

And actually, since it seems like this data is continuous, is there any way I can adjust the lognormal distribution method? (I've attached the output from the lognormal distribution analysis and the coding is below). I'm sorry to be a pest, I'm a beginner with SAS and just running stats in general. Wrapping my head around it is proving tricky.

Thanks again,

A

title2 'cDNA';
proc glimmix data=first;
class well date block;
model cDNA = well date date*well / dist=lognormal ddfm=kr;
random _residual_ / subject = block;
lsmeans date*well / tdiff adjust=tukey lines;  
output out=second predicted=pred residual=resid residual (noblup)=mresid student=studentresid student(noblup)=smresid;
run;

sld · Posted 11-30-2017 12:04 PM

Not the easiest statistical model to start with, but some times you just have to dive into the deep end. In addition to the Stroup text I linked to in my previous message, I highly recommend

https://www.sas.com/store/prodBK_59882_en.html

which will (sometime) soon have the equivalent of a 3rd ed:

SAS® for Mixed Models: An Introduction
By Walter W. Stroup, Ph.D., George A. Milliken, Ph.D., Elizabeth A. Claassen and Russell D. Wolfinger, Ph.D.
Anticipated publication date: First quarter 2018

What is "block" and how does it fit into your study design?

Are you specifically interested in just these 3 wells? (Alternatively, you could think of these 3 wells as a random sample from a statistical population of wells to which you would like to make inference. Of course, that would be a pretty small sample.)

Transforming a count to a concentration by dividing each count by a constant does not change the information contained in the cDNA variable--it just rescales it, and all other things equal, multiplying/dividing by a constant has no effect on the results of statistical tests. So although the rescaled values are no longer integers, at its heart, the variable is still a count. That doesn't mean that you necessarily have to use a discrete distribution. Distributions for counts (e.g., Poisson and negative binomial) are particularly useful when counts are small; and for those distributions, the variance implicitly increases as the mean increases. Your counts are not obviously small (some are huge), and surely variance increases with the mean; I could see Poisson or NB as possibilities, or lognormal, and the choice would depend upon the data characteristics. You apparently have only 24 observations, so the choice may never be obvious.

If you have not done so already, plot cDNA against date for each well for a visual assessment of the effect of date and how that effect may change over wells. You probably have to work block into this plotting, but I don't know yet what block is, so I can't say how you might need to do that.

aroebuck · Posted 11-30-2017 12:25 PM

Thanks for the resources! I guess my reasoning there was that "block" was the experimental unit that test factors (date and well, aka contamination severity) are being applied to (essentially "block" would be the well being sampled which would make "well" a proxy for contamination level since each well had varying concentrations of contaminant). I was thinking maybe the two need to be separated, but I tried running both "block" and "well" as the experimental unit and it doesn't seem to make much difference. Thanks for the input, I'm glad I'm not going too crazy with trying these distributions. It is just the three wells that I am considering for this particular analysis.

I do have plots of copy number vs. date for each well so I've been able to look at very general trends, but the analysis was stumping me a little.

Thank you again, you've been very helpful.

A

sld · Posted 11-30-2017 12:32 PM

So, "block" is the same thing as "well", just a different name?

You have 3 wells and 4 dates, but 24 observations. What accounts for 24 observations on 12 treatment combinations?

If you incorporate date as a categorical factor, you need a design (experimental/sampling) unit that serves as a replicate for each level of well. It's not obvious what that is so I doubt you have a correct model yet.

aroebuck · Posted 11-30-2017 01:03 PM

I suppose "block" would refer to the well itself as a sampling unit, while "well" would represent the contamination level. Sorry, I was initially treating them as one and the same (hence subject=well). I was thinking since the wells are intrinsically linked to the contamination level it would be the same thing.

I was testing it to see if adding "block" made a difference in the result since you had mentioned you didn't know what effect using "subject=well" would have. I have duplicate samples from each well (hence there are 24 observations instead of 12). That was why I was trying to add the "block" variable. I am sampling from the wells thus the wells themselves would be the experimental unit (which I called "block"). Each well is associated with one of the levels of contamination ("well") and sampled in four different seasons ("date") and were sampled in duplicate in each sampling event. Is that what you were thinking I needed, or am I misunderstanding?

A

sld · Posted 11-30-2017 04:54 PM

I'd say your experimental design thinking is a bit murky.

It is important to clearly distinguish between an experimental fixed effects factor and the experimental units (i.e., random effects factor) to which the levels of the fixed effects factor are randomly assigned.

As an example, let us say that you have 3 types of well contamination; this is the experimental fixed effects factor with 3 levels. To replicate the effect of contamination type, you "randomly" assign each level of contamination to 4 individual wells; wells are replicates == experimental units == random effects factor with 12 levels (because you have 3 x 4 = 12 wells). Obviously, a factor with 3 levels cannot be identical to a factor with 12 levels. Plus one is a fixed effects factor, and the other is a random effects factor.

In your study, you could think of block as the experimental unit for the well factor (with 3 levels), but then you have no replication and you can do no statistical test. The two factors (block and well) are identical and cannot both be included in one statistical model.

If you are thinking of well as a fixed effects factor and you want a statistical test of whether the 3 well means are equal, then the only thing you can do is to use the duplicate samples as replicates. The duplicate samples need to be independent draws of water from the well to function even minimally as replicates, so that the draws represent a random sample of all possible draws you could have made from that well on that date.

If that assumption seems untenable, you could consider using methodology for unreplicated experiments. See Chapter 5 in this text by Milliken and Johnson:

https://www.amazon.com/Analysis-Messy-Data-Nonreplicated-Experiments/dp/0412063719

where their sausage type is your well, and their judge is your date. This method is not a perfect solution, as discussed in the book, but might be preferable to assuming that your duplicate samples function as true replicates.

aroebuck · Posted 11-30-2017 05:29 PM

I think I see what you mean. The intention I had was to use the duplicate water samples as replicates as you mentioned and compare means between the wells and between seasons (dates). The data is from an in-situ site, and was very restrictive for a nice experimental design, sadly. Since you said that I can't include block and well in the same model because they are technically the same thing, was I right to use well as the experimental unit like I did initially?

Thanks,

A

sld · Posted 11-30-2017 07:47 PM

Using duplicate samples as replicates, with well and date as categorical fixed effects factors, the statistical model is a two-way factorial (well x date) in a completely randomized design. Try something like

proc glimmix data=first
  plots=(studentpanel boxplot(fixed));
  class well date;
  model cDNA = well date well*date / dist=lognormal;
  lsmeans well date / diff adjust=tukey lines;
  lsmeans well*date / plot=meanplot(sliceby=well join cl) slice=(well date) 
    slicediff=(well date) adjust=tukey; 
  run;

and see how the residual plots look with respect to the assumption of lognormal distribution. You could also try a Poisson or negative binomial using data on the count scale.

aroebuck · Posted 11-30-2017 09:14 PM

I gave the code you sent a try (residual plots attached). I tried it with the negative binomial and poisson distributions as well to see how they compared and they didn't improve much. The Poisson was much worse and the negative binomial didn't change much, but the fit statistics generally were much lower. I'm questioning whether the residuals are acceptable to be using this method. There seems to be a lot of variation. Also, I read that with factorial analyses, if the interaction between two factors is significant, then I cannot compare the levels of one factor with the levels of the other factor averaged out, which I think is in some of the output with that analysis. The interaction is significant here, so am I correct in saying that I would be looking at the comparisons of the well*date lsmeans only in this analysis to search for significant differences?

Thanks for your patience.

A

sld · Posted 11-30-2017 11:13 PM

I concur, the residuals look awful.

With the range of values taken by cDNA, the distributional characteristics probably are going to be a challenge. Try

proc glimmix data=first  plots=(studentpanel boxplot(fixed));
  class well date;
  model cDNA = well date well*date / dist=lognormal;  
  lsmeans well date / diff adjust=tukey lines;
  lsmeans well*date / plot=meanplot(sliceby=well join cl) slice=(well date) slicediff=(well date) adjust=tukey;
  random _residual_ / group=date; 
  run;

The RANDOM will fit a separate variance for each date.

Please post the entire output.

Generally (with some exceptions), if the interaction is significant, then main effects are nonsensical. Interaction implies that the story for main effect A is not the same at all levels of main effect B; a main effect for factor A applies the same story to all levels of factor B. So depending on the nature of the interaction, you might be looking at only the interaction. But first, you need a model that adequately meets assumptions.

This is the nature of analysis of real data 🙂

aroebuck · Posted 12-01-2017 09:25 AM

Okay. I tried the code with the additional line and the entire output is attached. I'm not sure how much better my residuals look with the separate variance for date. One other question I had I was wondering why a CRD is appropriate for this observational study? Just because the study is in situ, the wells aren't exactly randomly assigned with the levels on contaminant per se. Or do we just assume randomness because it would be the closest real experimental set up to what I have?

Yes, this data is proving to be very stubborn.

A

sld · Posted 12-01-2017 01:53 PM

The lsmean for well 29 in N16 is oddly low, and I am suspicious about the validity of those data. In fact, I'm a bit suspicious of the rest of the N16 data. What do you know about that?

It is a CRD because each duplicate sample is "randomly assigned" to a level of well and a level of date. Obviously, we are taking liberties with "randomly assigned", and you have to decide whether treating this study as a quasi-experiment is appropriate for your purposes.

aroebuck · Posted 12-01-2017 02:02 PM

The gene I was looking for was non-detectable in that well in N16. I was told to substitute a very low number instead of using zero so that I could still run statistical analysis on this. Often with gene copy data non-detects and even outliers are still valuable information biologically speaking, so I was instructed to keep them in my analysis.

I'm not sure why there should be any other issues with the N16 data. Are you referring to the look of the box plot?

Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

SAS Innovate 2025: Call for Content