I am currently analysing a data set from Illumina sequencing. I have a data set with Operational Taxonomic Units (OTUs), indicating counts of each bacterial sequence found on each sample.I converted the counts into percentage data because the relative abundance of each species, expressed as a percentage of the total number of counts is easier for explaining the biological meaning of bacterial relative abundances changing over time. My problem is that the total number of counts (library size) of each sample is different.
The objective of my study was to compare if the bacterial relative abundance between treatments is significantly different and to find whether time point, treatment or treatment*time point interaction are significant.
I want to use library size as a covariate because the library sizes are different and the variance will be different as a result. I want to fit a model that allows the variance to exceed the mean, because the bacterial relative abundances are overdispersed and I have many zeroes.
I used the proc GLIMMIX to assess effect of time and treatment on the relative abundance of the bacteria included in my community and I wanted to use the OFFSET option to account for the differences in library size. I used the following model, including the time point in the random statement because I have repeated measures and in proc GLIMMIX the “repeated” statement from the proc MIXED equals the “random” statement:
PROC GLIMMIX DATA=experiment; CLASS sample timepoint treatment; MODEL Bacteria1= treatment timepoint treatment|timepoint/link=log s dist=negbin DDFM=SATTERTH; Random timepoint/ subject=sample type=vc residual; LSMEANS timepoint|treatment; RUN;
But in the model above I am not adjusting for the differences in library size. I want to fit a model that allows the variance to exceed the mean, because my total counts per sample are overdispersed and I have many zeroes. Then I tried with the OFFSET option, but the optimisation did not converge:
PROC GLIMMIX DATA=experiment; CLASS sample timepoint treatment; MODEL Bacteria1= timepoint treatment timepoint|treatmentt/link=log s dist=negbin DDFM=SATTERTH OFFSET=library; random timepoint/ subject=sample type=vc residual; LSMEANS timepoint|treatment; RUN;
The purpose of my study is to compare the mean relative abundances of bacterial species. If I transform my data, I cannot have real means and I cannot further compare them among treatments. As a result, I tried to use a generalised linear model in SAS and I would like to have the comparison of the mean relative abundances among treatments.
At this moment I am not sure any more if the model I am using is correct, so I would appreciate any help from the community.
Best wishes,
Emma
This actually looks very good. The G matrix being NPD (not positive definite) is only mildly disturbing, as the zero value for variance still has a standard error. However, looking at all of the means on the log scale, it may be worth rescaling your variables. That could easily be done by rescaling the offset variable (prior to taking the log). Do something that logically makes it smaller. For example, if it were in milliseconds, rescaling to seconds or even minutes would help. Since I don't know what a natural measure for library size might be, I can't really make a strong suggestion.
Also, add the ilink option to your lsmeans statement:
lsmeans tpt*trt/diff ilink;
so that you can get lsmeans on the original scale.
Steve Denham
Hi Emma,
Before we get into zero inflated models (which will probably require some other proc than GLIMMIX), let's see if we can get some preliminary type results from GLIMMIX. I would change your code to the following:
PROC GLIMMIX DATA=experiment;
CLASS sample timepoint treatment;
MODEL Bacteria1= timepoint|treatmentt/ s dist=negbin DDFM=SATTERTH OFFSET=loglibrary;
random timepoint/ subject=sample type=ar(1) residual;
LSMEANS timepoint|treatment;
RUN;
Note that the offset variable is loglibrary, so that it will enter into the model as a 'denominator'. You should create this variable in the data set 'experiment'. I also changed the covariance type to ar(1) to model correlation over time. Type=vc assumes that each of the time points are completely independent, and so might not be the best choice.
I suspect that this will take more than the default iterations to converge and might also require a different optimization technique. If that is the case, insert the following line:
NLOPTIONS maxiter=500 tech=nrridg;
I like the ridged Newton-Raphson technique for overdispersed data. See how this works. Watch the log and the output for warnings and comments.
Steve Denham
Dear Steve,
Thank you very much for your kind answer. I tried the model you suggested by I removed the maxiter option because the model did not converge. When I used the following syntax:
PROC GLIMMIX DATA=new; CLASS sample tpt trt; MODEL Fn= trt tpt tpt*trt/ link=log s dist=negbin DDFM=SATTERTH OFFSET=loglib; random tpt/ subject=sample type=ar(1) residual; LSMEANS tpt*trt/pdiff; NLOPTIONS tech=nrridg; RUN;
I had the model converging and I obtained the following message: Convergence criterion (PCONV=1.11022E-8) satisfied
I could also fit the ante(1) covariance structure, so I was wondering which one would be correct to use. Finally, I was wondering if it is correct to use the log value of a variable as the offset... Once again, I appreciate very much your time and consideration!
My opinion: Offset variables are rarely used with distributions having something other than log as the canonical link. So, to get everything on the same page, so to speak, the offset variable should be expressed as the log of the observed variable.
As far as selecting the best covariance structure, you start to run into some problems. For distributions like the Poisson and negative binomial, the mean and variance are functionally related. Consequently, when you use the pseudo-likelihood based methods, your pseudo-data is not identical under different covariance structures. This means that the usual method of comparing information criteria, such as corrected AIC, isn't available for the marginal approach that is currently being modeled. You pretty much have to depend on your knowledge of the processes involved to come up with the most defensible covariance structure. GLIMMIX does not even output the various IC's under this approach.
However, if you move to a conditional approach, you fit quasi-likelihoods, and information criteria can be used to select the covariance structure that best fits your data. if you want to do this, try the following
PROC GLIMMIX DATA=new method=laplace empirical;
CLASS sample tpt trt;
MODEL Fn= trt tpt tpt*trt/ link=log s dist=negbin DDFM=BW OFFSET=loglib;
random tpt/ subject=sample type=ar(1);
LSMEANS tpt*trt/pdiff;
NLOPTIONS tech=nrridg;
RUN;
A Laplace integration method is used to approximate the likelihood function (first line), and the residual option is removed from the RANDOM statement, so that the repeated nature is fit as a G-side structure. Since this is now G side, the DDFM=Satterthwaite is no longer supported, so ddfm=bw is appropriate for a repeated measures design, and last but not least, I added the empirical option to the GLIMMIX statement to get standard error estimates that have a shrinkage factor built in to compensate for the lack of Satterthwaite adjustment.
Try it and see how the convergence goes.
Steve Denham
(PS - I am really surprised that you needed to remove the maxiter= option to get convergence.)
Hello Steve,
I have tried the suggested syntax and I obtained the following notes:
NOTE: The EMPIRICAL option for METHOD=LAPLACE adjusts the covariance matrix of the fixed effects
and the covariance parameters. It also affects the prediction covariance matrix for the
random effects solutions.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: Estimated G matrix is not positive definite.
NOTE: PROCEDURE GLIMMIX used (Total process time):
real time 1.65 seconds
cpu time 1.34 seconds
I have attached the output for further consideration. Once again, thank you for your time!
This actually looks very good. The G matrix being NPD (not positive definite) is only mildly disturbing, as the zero value for variance still has a standard error. However, looking at all of the means on the log scale, it may be worth rescaling your variables. That could easily be done by rescaling the offset variable (prior to taking the log). Do something that logically makes it smaller. For example, if it were in milliseconds, rescaling to seconds or even minutes would help. Since I don't know what a natural measure for library size might be, I can't really make a strong suggestion.
Also, add the ilink option to your lsmeans statement:
lsmeans tpt*trt/diff ilink;
so that you can get lsmeans on the original scale.
Steve Denham
Hello Steve,
Thank you again for the suggestion. I am actually not sure if I can rescale the library size, because this number represents the total bacterial sequences found in a particular sample (as indicated by sequencing).
My original table with the raw counts looked like this:
Sample | Tpt | Trt | An | Avisc | Aa | Fn | Pg | Pi | Sgord | Smit | Smut | Soralis | Ssal | Ssang | Ssob | Vp | Library |
T1AG10 | 1 | AGH | 3 | 4 | 33 | 72 | 43 | 2 | 308 | 3 | 14 | 593 | 8 | 12 | 10 | 2430 | 3535 |
Sample includes all the collected observations over time. Time point indicates time and TRT indicates treatment.
The following 14 columns contain relative abundance data of each of the 14 bacterial species. These numbers indicate that sequences from species An were counted 3 times, sequences from species Aa were found 33 times, sequences from species Pg were counted 43 times and so on. Ultimately, the total number of sequences found in this sample was 3535.
These raw counts we used to calculate proportions of the total library size, which are indicated in the following table.
Sample | Tpt | Trt | An | Avisc | Aa | Fn | Pg | Pi | Sgord | Smit | Smut | Soralis | Ssal | Ssang | Ssob | Vp | Library |
T1AG10 | 1 | AGH | 0.0849 | 0.1132 | 0.9335 | 2.0368 | 1.2164 | 0.0566 | 8.7129 | 0.0849 | 0.396 | 16.7751 | 0.2263 | 0.3395 | 0.2829 | 68.7412 | 3535 |
This number is given as a proportion of the library size, because the library size indicates the total number of sequences that were counted on each particular sample.
So, for instance, the 0.08 in the An column was obtained:
An in T1AG10 = (raw counts*100)/library size = (3*100)/3535 = 0.08
I preferred to convert raw counts into proportions because it's easier to understand the presence of a determined percentage of a bacterial genus rather saying "sequences from species X were found Y times in treatment Z".
In this way, I am not sure if there's anything to do in relation to rescaling the above library sizes.
If rescaling is not an issue (I'm hoping not), I think this method looks great to analyse at least several of my data sets with similar characteristics.
Just to see if the transformation helps the stability, such that the variance component does not go to zero, try a run where instead of the library being 3535 sequences, instead it might be 3.535 "kilosequences". When it is all said and done, the incidences can be put back on a per sequence basis using the values from the ilink option.
Steve Denham
Dear Steve,
Thank you for your kind advice. I am not sure if I can express it as kilosequences, as the total number of sequences is rather count data. Is it possible to transform this total counts into something else to test another run? Once again, thank you for your time.
Emma
Hello Steve,
I am coming back to this answer because I still haven't got it right. I ran the model after doing the ln (x) transformation of the total number of counts (library). Now I understand what you meant with rescaling. I have realised that the values that I get are extremely small. My original data was given in percentages of the total number of counts. You suggested me to rescale it based on the units in which library size is expressed, but there are no units. It's just a number of counts. Is there any other way to express the means? Now that I see the results, the values are not representative any more (they are too small to be considered a percentage of the total).
Again, thank you for your time!
Emma
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.