topic Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data. in Statistical Procedures

Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Thu, 30 Nov 2017 04:53:05 GMT

I'm currently using PROC GLIMMIX in SAS 9.4 trying to compare gene copy numbers from groundwater (Non-Gaussian). For the most part, using a lognormal distribution worked, but in a few cases this analysis showed no significant differences when I really believe there should be. I tried a negative binomial distribution (coding below) to see what changes and while the differences make more sense, the fit statistics (AIC, AICC, etc.) are missing from the output when I looked to compare distributions. I'm just wondering if this is because the model is not appropriate or if the fit statistics obtained are different depending on the distribution. My Generalized Chi-Square and Gen.Chi Square/DF are also coming out as zero which I don't believe is expected (Image attached).

Does anyone know what I should look for or what I can try to make sure this analysis is working correctly?

Thanks!

title2 'cDNA: Differences Between Dates';
proc glimmix data=first;
class well date;
model cDNA = well date date*well / ddfm=kr dist=negbin link=log;
random _residual_ / subject =well;
lsmeans date*well / tdiff adjust=tukey lines; 
output out=second predicted=pred residual=resid residual (noblup)=mresid student=studentresid student(noblup)=smresid;
run;

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

sld — Thu, 30 Nov 2017 05:52:33 GMT

Check the Model Information table to see what estimation method was used. A model with a random statement and a non-normal distribution will use a pseudo-likelihood method by default

http://documentation.sas.com/?docsetId=statug&docsetTarget=statug_glimmix_details66.htm&docsetVersion=14.3&locale=en

and by default IC statistics are not computed for pseudo-likelihood

http://documentation.sas.com/?docsetId=statug&docsetTarget=statug_glimmix_details76.htm&docsetVersion=14.3&locale=en

Ideally, the distribution of choice would be logically consistent with the inherent nature of the response. The lognormal distribution is appropriate for a response measured on a continuous-scale. The negative binomial response is appropriate for a count response. What is cDNA, what is its measurement scale, and what range of values does it take? "gene copy numbers" suggests a count.

Is "date" measured repeatedly on the same "well"? How many dates are there? We would benefit from a more detailed description of your study design.

I do not like the looks of your RANDOM statement. Generally, you would not use "random _residual_" for a negative binomial distribution because the negative binomial distribution already and implicitly estimates a scale parameter. To be honest, I have no idea what the impact of " / subject=well" would be. I'm guessing not good, but I could be overlooking something. From where did you get this syntax idea? (If it came from an example assuming a normal distribution, then it is not applicable to a negative binomial distribution.) If date is a repeated measures factor, then you can use a GLMM or a GEE, but you have to choose one or the other and specify appropriate code syntax.

An excellent resource for GLMMs (and to a lesser extent GEEs) in SAS is the text by Walt Stroup

https://www.amazon.com/Generalized-Linear-Mixed-Models-Applications/dp/1439815127/ref=sr_1_1?ie=UTF8&qid=1512020181&sr=8-1&keywords=stroup+mixed+model

(It would be an excellent holiday present.) These are complicated, persnickety, and treacherous models; and attempting them casually usually is asking for trouble.

Apart from the fact that I do not like your current model...the zero values for Generalized Chi-Square and Gen.Chi Square/DF could be just rounding. Note that the Estimate for Residual in the Covariance Parameters table is very small.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Thu, 30 Nov 2017 15:54:51 GMT

Thanks very much for the prompt reply! I have now managed to get the Fit Statistics. cDNA is copies of genes in a sample. The copies, initially would be a count, but I express them as copies/100mL groundwater. The values have a huge range (they can be anywhere between around 10 copies to billions of copies). "Date" is actually meant to represent different seasons and thus I included it as a fixed effect. The same wells are sampled each date, but for testing to see if seasons have an effect on gene copies. "Well" is also a treatment since the wells have varying degrees of contamination. I see what you mean about using /subject=well now. I had included it because the well is also the unit I am measuring. Now that I think of it, would adding corresponding block values to the wells and using "/subject=block" be a better idea?

class well date block;
model cDNA = well date date*well / dist=negbin link=log;
random intercept / subject =block;
lsmeans date*well / tdiff adjust=tukey lines;  
output out=second predicted=pred residual=resid residual (noblup)=mresid student=studentresid student(noblup)=smresid;
run;

And actually, since it seems like this data is continuous, is there any way I can adjust the lognormal distribution method? (I've attached the output from the lognormal distribution analysis and the coding is below). I'm sorry to be a pest, I'm a beginner with SAS and just running stats in general. Wrapping my head around it is proving tricky.

Thanks again,

title2 'cDNA';
proc glimmix data=first;
class well date block;
model cDNA = well date date*well / dist=lognormal ddfm=kr;
random _residual_ / subject = block;
lsmeans date*well / tdiff adjust=tukey lines;  
output out=second predicted=pred residual=resid residual (noblup)=mresid student=studentresid student(noblup)=smresid;
run;

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

sld — Thu, 30 Nov 2017 17:04:14 GMT

Not the easiest statistical model to start with, but some times you just have to dive into the deep end. In addition to the Stroup text I linked to in my previous message, I highly recommend

https://www.sas.com/store/prodBK_59882_en.html

which will (sometime) soon have the equivalent of a 3rd ed:

SAS® for Mixed Models: An Introduction
By Walter W. Stroup, Ph.D., George A. Milliken, Ph.D., Elizabeth A. Claassen and Russell D. Wolfinger, Ph.D.
Anticipated publication date: First quarter 2018

What is "block" and how does it fit into your study design?

Are you specifically interested in just these 3 wells? (Alternatively, you could think of these 3 wells as a random sample from a statistical population of wells to which you would like to make inference. Of course, that would be a pretty small sample.)

Transforming a count to a concentration by dividing each count by a constant does not change the information contained in the cDNA variable--it just rescales it, and all other things equal, multiplying/dividing by a constant has no effect on the results of statistical tests. So although the rescaled values are no longer integers, at its heart, the variable is still a count. That doesn't mean that you necessarily have to use a discrete distribution. Distributions for counts (e.g., Poisson and negative binomial) are particularly useful when counts are small; and for those distributions, the variance implicitly increases as the mean increases. Your counts are not obviously small (some are huge), and surely variance increases with the mean; I could see Poisson or NB as possibilities, or lognormal, and the choice would depend upon the data characteristics. You apparently have only 24 observations, so the choice may never be obvious.

If you have not done so already, plot cDNA against date for each well for a visual assessment of the effect of date and how that effect may change over wells. You probably have to work block into this plotting, but I don't know yet what block is, so I can't say how you might need to do that.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Thu, 30 Nov 2017 17:25:29 GMT

Thanks for the resources! I guess my reasoning there was that "block" was the experimental unit that test factors (date and well, aka contamination severity) are being applied to (essentially "block" would be the well being sampled which would make "well" a proxy for contamination level since each well had varying concentrations of contaminant). I was thinking maybe the two need to be separated, but I tried running both "block" and "well" as the experimental unit and it doesn't seem to make much difference. Thanks for the input, I'm glad I'm not going too crazy with trying these distributions. It is just the three wells that I am considering for this particular analysis.

I do have plots of copy number vs. date for each well so I've been able to look at very general trends, but the analysis was stumping me a little.

Thank you again, you've been very helpful.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

sld — Thu, 30 Nov 2017 17:32:37 GMT

So, "block" is the same thing as "well", just a different name?

You have 3 wells and 4 dates, but 24 observations. What accounts for 24 observations on 12 treatment combinations?

If you incorporate date as a categorical factor, you need a design (experimental/sampling) unit that serves as a replicate for each level of well. It's not obvious what that is so I doubt you have a correct model yet.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Thu, 30 Nov 2017 18:03:24 GMT

I suppose "block" would refer to the well itself as a sampling unit, while "well" would represent the contamination level. Sorry, I was initially treating them as one and the same (hence subject=well). I was thinking since the wells are intrinsically linked to the contamination level it would be the same thing.

I was testing it to see if adding "block" made a difference in the result since you had mentioned you didn't know what effect using "subject=well" would have. I have duplicate samples from each well (hence there are 24 observations instead of 12). That was why I was trying to add the "block" variable. I am sampling from the wells thus the wells themselves would be the experimental unit (which I called "block"). Each well is associated with one of the levels of contamination ("well") and sampled in four different seasons ("date") and were sampled in duplicate in each sampling event. Is that what you were thinking I needed, or am I misunderstanding?

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

sld — Thu, 30 Nov 2017 21:54:23 GMT

I'd say your experimental design thinking is a bit murky.

It is important to clearly distinguish between an experimental fixed effects factor and the experimental units (i.e., random effects factor) to which the levels of the fixed effects factor are randomly assigned.

As an example, let us say that you have 3 types of well contamination; this is the experimental fixed effects factor with 3 levels. To replicate the effect of contamination type, you "randomly" assign each level of contamination to 4 individual wells; wells are replicates == experimental units == random effects factor with 12 levels (because you have 3 x 4 = 12 wells). Obviously, a factor with 3 levels cannot be identical to a factor with 12 levels. Plus one is a fixed effects factor, and the other is a random effects factor.

In your study, you could think of block as the experimental unit for the well factor (with 3 levels), but then you have no replication and you can do no statistical test. The two factors (block and well) are identical and cannot both be included in one statistical model.

If you are thinking of well as a fixed effects factor and you want a statistical test of whether the 3 well means are equal, then the only thing you can do is to use the duplicate samples as replicates. The duplicate samples need to be independent draws of water from the well to function even minimally as replicates, so that the draws represent a random sample of all possible draws you could have made from that well on that date.

If that assumption seems untenable, you could consider using methodology for unreplicated experiments. See Chapter 5 in this text by Milliken and Johnson:

https://www.amazon.com/Analysis-Messy-Data-Nonreplicated-Experiments/dp/0412063719

where their sausage type is your well, and their judge is your date. This method is not a perfect solution, as discussed in the book, but might be preferable to assuming that your duplicate samples function as true replicates.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Thu, 30 Nov 2017 22:29:11 GMT

I think I see what you mean. The intention I had was to use the duplicate water samples as replicates as you mentioned and compare means between the wells and between seasons (dates). The data is from an in-situ site, and was very restrictive for a nice experimental design, sadly. Since you said that I can't include block and well in the same model because they are technically the same thing, was I right to use well as the experimental unit like I did initially?

Thanks,

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

sld — Fri, 01 Dec 2017 00:47:50 GMT

Using duplicate samples as replicates, with well and date as categorical fixed effects factors, the statistical model is a two-way factorial (well x date) in a completely randomized design. Try something like

proc glimmix data=first
  plots=(studentpanel boxplot(fixed));
  class well date;
  model cDNA = well date well*date / dist=lognormal;
  lsmeans well date / diff adjust=tukey lines;
  lsmeans well*date / plot=meanplot(sliceby=well join cl) slice=(well date) 
    slicediff=(well date) adjust=tukey; 
  run;

and see how the residual plots look with respect to the assumption of lognormal distribution. You could also try a Poisson or negative binomial using data on the count scale.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Fri, 01 Dec 2017 02:14:44 GMT

I gave the code you sent a try (residual plots attached). I tried it with the negative binomial and poisson distributions as well to see how they compared and they didn't improve much. The Poisson was much worse and the negative binomial didn't change much, but the fit statistics generally were much lower. I'm questioning whether the residuals are acceptable to be using this method. There seems to be a lot of variation. Also, I read that with factorial analyses, if the interaction between two factors is significant, then I cannot compare the levels of one factor with the levels of the other factor averaged out, which I think is in some of the output with that analysis. The interaction is significant here, so am I correct in saying that I would be looking at the comparisons of the well*date lsmeans only in this analysis to search for significant differences?

Thanks for your patience.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

sld — Fri, 01 Dec 2017 04:13:40 GMT

I concur, the residuals look awful.

With the range of values taken by cDNA, the distributional characteristics probably are going to be a challenge. Try

proc glimmix data=first  plots=(studentpanel boxplot(fixed));
  class well date;
  model cDNA = well date well*date / dist=lognormal;  
  lsmeans well date / diff adjust=tukey lines;
  lsmeans well*date / plot=meanplot(sliceby=well join cl) slice=(well date) slicediff=(well date) adjust=tukey;
  random _residual_ / group=date; 
  run;

The RANDOM will fit a separate variance for each date.

Please post the entire output.

Generally (with some exceptions), if the interaction is significant, then main effects are nonsensical. Interaction implies that the story for main effect A is not the same at all levels of main effect B; a main effect for factor A applies the same story to all levels of factor B. So depending on the nature of the interaction, you might be looking at only the interaction. But first, you need a model that adequately meets assumptions.

This is the nature of analysis of real data 🙂

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Fri, 01 Dec 2017 14:25:10 GMT

Okay. I tried the code with the additional line and the entire output is attached. I'm not sure how much better my residuals look with the separate variance for date. One other question I had I was wondering why a CRD is appropriate for this observational study? Just because the study is in situ, the wells aren't exactly randomly assigned with the levels on contaminant per se. Or do we just assume randomness because it would be the closest real experimental set up to what I have?

Yes, this data is proving to be very stubborn.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

sld — Fri, 01 Dec 2017 18:53:17 GMT

The lsmean for well 29 in N16 is oddly low, and I am suspicious about the validity of those data. In fact, I'm a bit suspicious of the rest of the N16 data. What do you know about that?

It is a CRD because each duplicate sample is "randomly assigned" to a level of well and a level of date. Obviously, we are taking liberties with "randomly assigned", and you have to decide whether treating this study as a quasi-experiment is appropriate for your purposes.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Fri, 01 Dec 2017 19:02:58 GMT

The gene I was looking for was non-detectable in that well in N16. I was told to substitute a very low number instead of using zero so that I could still run statistical analysis on this. Often with gene copy data non-detects and even outliers are still valuable information biologically speaking, so I was instructed to keep them in my analysis.

I'm not sure why there should be any other issues with the N16 data. Are you referring to the look of the box plot?

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

sld — Fri, 01 Dec 2017 19:32:22 GMT

Hmm. Do you think that the gene was truly not there or there at very low levels, or do you think you might have had a problem with the process by which counts are acquired? What is the detectability limit for your process? You can clearly see that how you deal with non-detectable results has a big impact on the statistical analysis. This is not a context that I am not very familiar with, so I cannot offer you much advice about an analysis approach.

I'm struck by the low variability among the duplicate samples at N16, and I wonder whether your process was running well for these samples.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Fri, 01 Dec 2017 19:40:34 GMT

I don't think there should be any issues with that well in particular (29 in N16). The samples from well 30 and 28 were sampled processed exactly the same way. To quantify genes one uses a thermocycler. I don't think there should be a problem because I randomly ran my samples in the thermocycler (you can do up to 96 at a time). Samples from well 29 were run at the same time as those from 28 & 30 for that sampling event and both of those worked. I can't think of a reason that well should be different.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Fri, 01 Dec 2017 19:46:31 GMT

Sorry I missed some of your other questions. qPCR can detect genes down to almost a single copy (I think the technical limit is 3). I filtered large water samples, then extracted DNA and quantified gene copies in the extract using the thermocycler. You only load a very small amount in the thermocycler but then back-calculate to copies/100mL groundwater.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

sld — Fri, 01 Dec 2017 20:44:46 GMT

I'll play devil's advocate: Are you even the tiniest bit skeptical that Well 29, which was similar to the other two wells at the other 3 sampling dates, should be non-detectable at N16? What biological/hydrological mechanisms would explain that? (I don't need to know, but you do.)

Are there procedures in your protocol, from sampling to inserting into the thermocycler, where something might have gone awry? Are the duplicate samples processed entirely separately, or are they two aliquots of the same sample? Did you count other genes, and do any of them show a similar pattern?

Once you've resolved the data validity issues to your satisfaction, you can return to the statistical analysis. Making the zeros be anything "small" will produce the same thing you've seen so far and will trigger significant results for well, date and interaction. If you want to see the impact of the detectibility-replacement, run the analysis without N16. The heterogeneous variances model probably is about as good as is feasible; you won't see any difference in the residual plots, but you'll note differences in SEs for different dates and fit statistics.

Re: Fit Statistics Missing when negative binomial distribution is used for Non-Gaussian Data.

aroebuck — Fri, 01 Dec 2017 21:07:52 GMT

You have made me think of something actually! My data set is composed of multiple depths (which I am not supposed to factor into this analysis). (Eg. 1 "sample" at 1 depth in a well has reps A & B which are my "duplicates" and I have 1 "sample" (duplicated) per depth (up to seven depths) for each well. However, each "sample" is not representative of the entire well; they are very different due to the geology, so these samples cannot be used a replicates for the well. Instead, what I did was take an average of Rep A for all the depths and another average of Rep B for all the depth to represent the well overall. These two values I am using as "replicates" for the analysis we have been discussing. What you said made me think that due to some restraints, were were unable to sample all of the same depths at each date. In N15 all 6 depths were sampled, but in N16 only three were sampled. You made me realize that an entire "well" average would therefore not be comparable between dates. What I should do instead is find the depths that were sampled on all four dates and only look at those. This, I think should balance things out quite a bit more. This will help with well 29 as well; two of the three depths were non-detect, while one had detection in only one of the replicates (this can happen particularly at very low copy numbers). I will give it a shot and see how it affects the outcome.