Thank you for reading my post. I am analysing metabolic data from an experiment where I performed fecal incubations in bottles. I selected 8 donors, in triplicate, and I measured these metabolites at time 0h and 16h. Two different treatments were provided for each donor, so I had 3 bottles with TRT1 and 3 bottles with TRT2, at two time points, for a total of 12 measurements per person. However,
data from two people at time 0h was lost, so my data was unbalanced in the end.
My data looks like this:
data vfas;
input ID$ Donor$ Trt$ Tpt$ acetate propionate isobutyrate butyrate isovalerate valerate isocaproate caproate total pH;
cards;
The attached file contains a subsection of my data.
Where donor=person sampled (D1-D8), trt= treatment (CX/PEG), tpt= time point (0h or 16h)
If I run a mixed model with the REPEATED statement, I get a non-est in the least square means, so I tried the following option:
proc mixed data=vfas;
class trt donor tpt;
model acetate= donor tpt trt donor*trt trt*tpt trt*tpt*donor;
Lsmeans trt*tpt trt*tpt*donor/ diff om bylevel;
run;
I know that this is incorrect because here I assumed that the measurements were independent. I used the statement BYLEVEL to avoid using the missing data to calculate the lsmeans. This model ran but I think is not correct, so I tried the following:
proc mixed data=vfas covtest;
class trt donor tpt;
model acetate= donor trt tpt trt*tpt*donor trt*tpt/ddf= 7,1,1,10,1;
repeated tpt/ subject=donor type=cs ;
lsmeans trt*tpt trt*tpt*donor/pdiff ;
run;
I specified the degrees of freedom based on what I had in the first model, for each of the effects specified in the model statement. I still have NON-EST in the treatment*time point effect, and I do not know how to avoid this. Besides, I could only use the CS structure with this syntax, because it won't run if I change it to AR(1), which I think is more correct.
I am really interesting in seeing the effect of treatment*time point, so I would appreciate if someone can indicate me whether I can use any other model (PROC GLIMMIX?) or what do I need to correct from the above codes. Thank you for your time!
Emma
First off, looking at your posting history, I see that you are working with some challenging models. I highly recommend two resources:
(1) https://www.sas.com/store/books/categories/usage-and-reference/sas-for-mixed-models-second-edition/p... (the first volume of the new edition of which will soon be available, I saw a preprint last month, so excited!), and
(2) https://www.amazon.com/Generalized-Linear-Mixed-Models-Applications/dp/1439815127
and
cultivating a statistician as a close colleague--ideally someone you can physically sit with at a table to discuss your research. These models are difficult and persnickety. It is treacherous (and naive) to assume that you can know enough to do a good job with statistics when your main interest is elsewhere (as it should be if you are not a statistician). This forum is not a substitute for real, problem-specific statistical advice.
You are missing a fundamental understanding of the distinction between fixed and random effects, which studing the references above may remedy; plus this paper http://onlinelibrary.wiley.com/doi/10.2307/1941729/abstract has a great discussion of fixed/random issues.
I am assuming below that observations at time 0h and 16h were obtained on the same bottle. If this is not a correct assumption, then everything below would need adjustment.
Assuming that I understand your design correctly, this is where I would start; it is not necessarily where I would end. If your response variables follow the normal distribution, then you could accomplish a similar model using MIXED, but GLIMMIX is my preferred procedure because it is more flexible.
proc glimmix data=have; class trt tpt donor bottle; model y = trt tpt trt*tpt; random donor donor*trt bottle(donor trt) tpt*donor*trt; run;
Note that I have added a factor for bottle identification that does not currently exist in your dataset.
Donor is your replicate. As such, it must be a random effects factor and should not be in the MODEL or LSMEANS statement. That should address the NON-EST problem, I think.
BOTTLE is a subsample. If you had balanced data, then I would recommend computing a mean for the variable over the 3 bottles, and then using the mean as the response in a simpler model which drops BOTTLE. You could still try this approach; in practice, unless the unbalance is extreme, I've found it makes little difference.
With only two repeated measurements on the same bottle, the choice of covariance structures is reduced to whether variances are the same at 0h and 16h, or not (i.e., homogenous variances). Compound symmetry CS is equivalent to AR(1); the choice is CS versus CSH.
My experience with GLIMMIX (or MIXED) is that it tends to give me the denomDF that I think are appropriate, but I always sketch out what I expect in advance. Sometimes I specify with the ddf= option, but not often. You want to be very comfortable (and knowledgeable) about the statistical model before you override the default. (But you should not trust the default, by default.)
Assuming that the response (conditional on the predictors) follows a normal distribution, you should get the same result using MIXED as GLIMMIX. The challenge lies in first defining the correct model.
Thank you for your kind answer. Yes, the observations were obtained from the same bottle. I have two measurements per replicate bottle because I wanted to see how reproducible was my inoculation of the treatment. It was basically to just to visually compare if the values of the metabolites detected were the same and that I didn't make mistakes when injecting the treatment. Thus, I didn't want to add bottle as an additional factor because I was afraid that it would complicate my design. I appreciate that you pointed out that I can use the means instead. I have only missing data at the initial time point in 2 donors out of 8, so it may not make a difference, like you mentioned. I will definitely try this approach.
Indeed donor is my replicate, that is the reason for using 8 people. My objective was to compare whether treatment 1 had and impact on the metabolites measured, in comparison with the control (trt2). I understand that it is a random effect, however I was interested in seeing the differences between each person, for each metabolite. For this reason, I added it in the model statement. I agree than then I would have a model only with fixed factors, but I cannot use GLM because it is a repeated measures design. Thus, I continued using PROC MIXED.
I was actually not comfortable changin the ddf=, because I was actually not sure, and that is the reason for asking my question in the forum. I even continued thinking yesterday and I tried the following:
proc mixed data=vfas covtest;
class trt donor tpt;
model acetate= trt tpt trt*tpt donor(trt tpt)/ddfm= kr;
repeated tpt/ subject=donor type=cs ;
lsmeans donor(trt tpt) trt*tpt /pdiff ;
run;
Which gave me what I wanted, but I still have donor as fixed effect. So, please roast me by all means, I just ordered the books you recommended, and I would like to know whether the above is correct.
It would be absolutely fantastic to have a knowledgeable colleague who could assist me in person with my questions. Unfortunately that is not the case in my Department, and/or Faculty. I even tried with the statistical support from the University, but they have only ONE GUY who has any idea of SAS. The last time I asked him about using an offset in PROC GLIMMIX he replied 1 month later, and I had a clearer and more interactive communication with Steve Denham in this forum.
My posting history is so sparse because it takes me time to read and understand lots of sources, and the forum is literally my last resort. My background is clearly not Statistics, and as I do not have a person who can support me, the forum is my best bet. I could also quit doing this but I don't think my PI would be thrilled. Thus, I have to continue reading and learning, which is exciting anyway.
Again, sincere thanks for taking your time to recommend me the books and for shedding some light on my code.
Best wishes,
Emma
I'm not convinced that I have a correct understanding of your experimental design. Your message suggests to me that you might be drawing subsamples from each bottle at each timepoint. Let me know whether this is right:
For each of 8 donors, you make 6 bottles of fecal sample. 3 bottles are control (trt=2) and 3 bottles are treatment (trt=1). At 0h (tpt=1) you extract 2 aliquots (or something of that nature) from each bottle and measure various metabolites. At 16h (tpt=2) you extract another 2 aliquots from each bottle and measure various metabolites.
Once we are straight on the design, we can move forward to details about the statistical model.
Meanwhile, I will say that if donor is your replicating factor, then donor must be a random effects factor and donor cannot be in the MODEL or LSMEANS statements. However, it is possible to get predictions that are specific for individual donors using narrow inference; see Ch 6 in SAS System for Mixed Models and
I agree, Steve Denham is a great resource, very generous with his time and expertise.
Hello again Sld,
Your description of the experiment is exactly what I did. Is it correct to think that I can add the following statement to a mixed model?
estimate 'trt' trt -1 1;
I also ran the PROC GLIMMIX model that you suggested and I had the message "Estimated G matrix is not positive definite" and the
following output:
Covariance Parameter Estimates Cov Parm Estimate Standard Error Donor 43.3412 31.0805 Trt*Donor 0 . Bottle(Trt*Donor) 0 . Trt*Tpt*Donor 30.3504 12.6126 Residual 33.7672 4.0088
In addition, I tried the ESTIMATE statement to make directed comparisons, as follows:
proc mixed data=vfas covtest;
class trt tpt donor bottle;
model acetate = trt tpt trt*tpt ;
repeated donor / subject=donor(tpt trt) type=cs ;
lsmeans trt*tpt /pdiff;
estimate "D1-D2 at T16" donor 1 -1 0 0 0 0 0 0 trt*tpt 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0;
estimate "D1-D3 at T16" donor 1 0 -1 0 0 0 0 0 trt*tpt 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0;
estimate "D1-D4 at T16" donor 1 0 0 -1 0 0 0 0 trt*tpt 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0;
estimate "D1-D5 at T16" donor 1 0 0 0 -1 0 0 0 trt*tpt 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0;
estimate "D1-D6 at T16" donor 1 0 0 0 0 -1 0 0 trt*tpt 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0;
estimate "D1-D7 at T16" donor 1 0 0 0 0 0 -1 0 trt*tpt 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0;
estimate "D1-D8 at T16" donor 1 0 0 0 0 0 0 -1 trt*tpt 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0;
estimate "D2-D3 at T16" donor 0 1 -1 0 0 0 0 0 trt*tpt 0 0 0 0 0 0 0 0 1 1 -1 0 0 0 0 0 0 0;
estimate "D2-D4 at T16" donor 0 1 0 -1 0 0 0 0 0 trt*tpt 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0;
estimate "D2-D5 at T16" donor 0 1 0 0 -1 0 0 0 0 trt*tpt 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0;
estimate "D2-D6 at T16" donor 0 1 0 0 0 -1 0 0 trt*tpt 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0;
estimate "D2-D7 at T16" donor 0 1 0 0 0 0 -1 0 trt*tpt 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0;
estimate "D2-D8 at T16" donor 0 1 0 0 0 0 0 -1 trt*tpt 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0;
run;
However, the code does not run with the above and I got the following message:
"Effects used in the ESTIMATE statement must have appeared previously in the MODEL statement". If I remove the ESTIMATE statements, the convergence criteria met, there were no error messages and I get the almost same results in the residuals as in the PROC GLIMMIX suggested, as indicated here:
Covariance Parameter Estimates Cov Parm Subject Estimate Standard Error Z Value Pr Z CS Donor(Trt*Tpt) 67.7318 21.2214 3.19 0.0014 Residual 33.7651 4.0082 8.42 <.0001
Maybe this is an stupid observation anyway.
In any case, I am stuck with the fact that I cannot include "donor" anywhere in the model, but I want to see whether the metabolites on each person are significantly different among people. For instance, whether metabolite1 is different in D2 when compared with D3.
I know that my original intention was to have more people to increase the power of the analysis, but now the discussion came up because I was asked to see which person is different in regards of each metabolite. This is relevant because they may have different bacterial population in the gut, which I measured as well.
I appreciate your generosity and that of the members of the community who have provided me with support in the past. I will be extremely grateful if you can help me to solvee this mess again. I would actually pay for having this kind of support. Again, thanks for your time!
Emma
I won't have time to respond in any detailed way for several days. If you are inclined to wait, you can devote some effort to the concept of narrow inference that I mentioned in a previous response. I think it will allow you to make the donor comparisons you are interested in; if you disagree once you've had a chance to think about it in more detail, let me know.
Other quick thoughts:
Because TRT has only two levels, you do not need to use an ESTIMATE statement. Instead, you can use
lsmeans trt / diff;
At least for the problems I work with, it is not uncommon for variance estimates to be set to zero by the procedure, which produces the "Estimated G matrix is not positive definite" message. See
http://support.sas.com/resources/papers/proceedings12/332-2012.pdf
Usually, I interpret a set-to-zero estimate as being a variance that is very small; many references exist on this topic. In my opinion, there is not one "right" way to deal with this modeling result. SAS offers the NOBOUND option; some people recommend dropping the term from the model. Personally, I do not drop the term from the model if it represents a design element in the experiment. I often use NOBOUND, but not always.
The ESTIMATE statements in your code do not work because DONOR is a random effects factor and cannot be included in ESTIMATE statements in the way that you have specified. Do some reading about "narrow inference".
Regarding
repeated donor / subject=donor(trt tpt) type=cs;
First, repeated donor / ... assumes that donor is a fixed effect factor, which I do not think is the case. Second, donor(trt tpt) assumes that donor is nested within both trt and tpt, which definitely is not true. These are mixed model issues that are addressed in the text SAS System for Mixed Models, so you may find clarification there once you have access to that.
Hello again Sld,
Thank you for your answer. I will continue reading, and I will ask questions once I finish understanding additional literature. I dislike wasting the time of others with nonsense questions.
Best wishes,
Emma
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.