Hey SAS community,
I am working on the analysis for a study that I am currently replicating. A bit of background on this study. It is an RBD looking at the efficacy of three fungicides on wheat. The response variables are emergence which is binary (0 or 1) and plot biomass which is continuous. The explanatory variables are fungicide and inoculum. I have broken this into two different data sets, one for the biomass and one for emergence and analyzing them separately. There are 8 blocks and there is subsampling within the emergence (n=18 per treatment) and the plot biomass is a summed biomass measurement for each treatment. I have datasets included and I will start with my progress on the emergence study.
I would encourage you to check the dataset I have attached and below I have posted the formatting for how I structured my code and then my actual code. This is followed by some of my results and what issues I am addressing.
Format:
Proc glimmix data=;
class alpha beta block rep;
model y/n = alpha|beta / dist=bin link=logit;
random block block*alpha*beta*rep;
run;
code used:
data wheat_emerg;
input Inoculum $ Fungicide $ Emergence Block Rep;
cards;
non-inoculated untreated 1 1 1
non-inoculated untreated 1 1 2
non-inoculated untreated 1 1 3
non-inoculated untreated 1 1 4
non-inoculated untreated 1 1 5
non-inoculated untreated 1 1 6
non-inoculated untreated 1 1 7
non-inoculated untreated 1 1 8
non-inoculated untreated 1 1 9
non-inoculated untreated 1 1 10
non-inoculated untreated 1 1 11
non-inoculated untreated 1 1 12
non-inoculated untreated 1 1 13
non-inoculated untreated 0 1 14
non-inoculated untreated 0 1 15
non-inoculated untreated 0 1 16
non-inoculated untreated 0 1 17
non-inoculated untreated 0 1 18
Fusarium untreated 1 1 1
Fusarium untreated 1 1 2
Fusarium untreated 1 1 3
.....
run;
proc print data=wheat_emerg;
run;
proc glimmix plots=residualpanel data=wheat_emerg;
class Fungicide Inoculum Block Rep;
model Emergence=Fungicide|Inoculum/ddfm=kr dist=binomial link=logit;
random Block Block*Fungicide*Inoculum*Rep;
lsmeans Fungicide/lines ilink;
lsmeans Inoculum/lines ilink;
lsmeans Fungicide*Inoculum/lines ilink slicediff=Inoculum slicediff=Fungicide plots=(mean(clband connect sliceby=Fungicide));
run;
results:
My concerns:
Wheat biomass
Format used:
Proc glimmix data=;
class alpha beta block;
model y = alpha|beta;
random block;
run;
code:
data wheat_shoot;
input Inoculum $ Fungicide $ Shoot_Weight Block;
cards;
non-inoculated untreated 9.9 1
Fusarium untreated 3.6 1
Fusarium Maxim 9.5 1
Fusarium Saltro 9.8 1
Fusarium Tymirium 9 1
non-inoculated untreated 9.5 2
Fusarium untreated 2.3 2
Fusarium Maxim 10.2 2
Fusarium Saltro 6.5 2
Fusarium Tymirium 8.1 2
non-inoculated untreated 8 3
Fusarium untreated 3.3 3
Fusarium Maxim 8.5 3
Fusarium Saltro 4.7 3
Fusarium Tymirium 5.3 3
……
run;
proc print data=wheat_shoot;
run;
proc glimmix plots=residualpanel data=wheat_shoot;
class Fungicide Inoculum Block;
model Shoot_Weight=Fungicide|Inoculum/ddfm=kr dist=normal;
random Block;
lsmeans Fungicide/lines;
lsmeans Inoculum/lines;
lsmeans Fungicide*Inoculum/lines slicediff=Inoculum slicediff=Fungicide plots=(mean(clband connect sliceby=Fungicide));
run;
concerns:
Bonus question!
As mentioned, I will be replicating this experiment. Any advice on how I would incorporate a replicate of the entire study into the analysis? I’m assuming I just need add another variable. Is this considered a random term and treated in the same fashion as “Block”? Thanks for any advice here.
Thanks a ton to everyone who took the time to sift through this lengthy post and provide feedback. I will be checking this daily for the coming weeks. Hopefully it’s an easy fix!
As @SteveDenham notes, you have an incomplete factorial for fungicide and inoculum factors, and one way to analyze these data is to make one factor with five levels, as Steve illustrates in his code snippet.
Another way is to partition the data into two sets, one to compare inoculum for fungicide=1, and one to compare fungicides for inoculum="inoculated", and analyze each separately. The primary difference between these two approaches is that the residual variance (and hence, SEs) in the one-way with five levels is based on the full dataset, whereas the variances in the analyses of data partitions are based on only the data used in the model. For a well-behaved dataset (notably one with homogeneous variances), the stories from the results will be similar. All other things equal, variances based on more data are better, so that's an advantage of the one-way with 5 treatments. The disadvantage of the one-way with 5 treatments is that you must identify the actual hypotheses of interest (which are the various hypotheses tested in the two partitions) and code LSMESTIMATE statements to estimate and test them. For this study, that will not be that hard. (See CONTRAST and ESTIMATE Statements Made Easy: The LSMESTIMATE Statement by Kiernan et al.) For more complicated designs, I've sometimes taken the partition approach.
/* Assess effects of 5 different treatments: combinations of fungicide and inoculum */
proc glimmix plots=(studentpanel boxplot) data=wheat;
class Fungicide Inoculum Block;
model Shoot_Weight = Fungicide*Inoculum / ddfm=kr dist=normal;
random Block;
lsmeans Fungicide*inoculum / lines;
/* Add appropriate lsmestimate statements to test hypotheses of interest */
run;
/* Compare untreated inoc to untreatment non-inoc to assess effect of inoculum
using a partition of the dataset */
proc glimmix plots=(studentpanel boxplot) data=wheat(where = (fungicide = 1)) nobound;
class Fungicide Inoculum Block;
model Shoot_Weight = Inoculum / ddfm=kr dist=normal;
random Block;
lsmeans Inoculum / lines;
run;
/* Assess effects of fungicide, given inoculum
using a partition of the dataset */
proc glimmix plots=(studentpanel boxplot) data=wheat(where = (inoculum = "inoculated")) nobound;
class Fungicide Inoculum Block;
model Shoot_Weight = Fungicide / ddfm=kr dist=normal;
random Block;
lsmeans fungicide / pdiff adjust=tukey lines;
run;
For the binary emergence data, you need a modification of the design structure of your model to accommodate the 18 subsamples in each plot in each block, something like
/* Assess effects of 5 different treatments: combinations of fungicide and inoculum */
proc glimmix plots=(studentpanel boxplot) data=wheat;
class Fungicide Inoculum Block;
model Emergence = Fungicide*Inoculum / dist=binomial;
random Block
Block*Fungicide*Inoculum;
lsmeans Fungicide*Inoculum / lines ilink;
/* Add appropriate lsmestimate statements to test hypotheses of interest */
run;
Regarding your bonus question: Steve has many excellent insights, and I agree with them all. I don't have an answer to your question, just more questions. Are these field experiments? Is the second experiment in a different location and/or a different year? What do you see as the purpose of the second experiment? If it is intended as a true replication, what is the statistical population to which you make inference? (Replications are, theoretically, a random sample from that population. Is a sample size of two big enough to be adequately representative?) Or do you think of it more as an exercise in repeatability across space and time? Read The many faces of replication by Johnson in Crop Science; it is an excellent resource for thinking about such things.
I hope this helps move you forward.
Forgot to include the data set which can be found here
The non-estimable part arises from the design - you have only fungicide=non-treated with inoculum=non-inoculated. Really quickly, find a copy of Milliken and Johnson's Analysis of Messy Data, vol. 1. In there you will find what they define as a "means model". That is essentially a one-way model, defined by the interaction. Your analysis would go forward using that as the fixed effect, and then pulling out the values and tests of interest using LSMESTIMATE statements. This would be a start:
proc glimmix plots=residualpanel data=wheat_emerg;
class Fungicide Inoculum Block Rep;
model Emergence=Fungicide*Inoculum/ddfm=kr dist=binomial link=logit;
random intercept Fungicide*Inoculum*Rep/subject=block;
lsmeans Fungicide*Inoculum/ ilink ;
/* Insert lsmestimate statements here to combine cell means to get overall fungicide and inoculum means */
/* For differences, you will likely need to use the %Margins macro to correctly calculate the standard errors of the differences,
but for now, you can construct them by taking the difference between the lsmestimates for the four instances */
run;
Your bonus question is an area of a lot of contention. Since you likely only have 2 levels, treating the replication at this level as a random effect may not give you an estimate, or will lead to a message in the output that the G matrix is not positive definite.
If you are exactly duplicating the study, you might consider adding additional blocks and reps within blocks as obtained from the new study to the existing ones, or fitting it as a fixed effect, and calling it something like 'study_year';
For this particular area, there is a great resource available in this community, so I am putting out a call for @sld . She may have way more insight into analyzing studies with a design like this.
SteveDenham
As @SteveDenham notes, you have an incomplete factorial for fungicide and inoculum factors, and one way to analyze these data is to make one factor with five levels, as Steve illustrates in his code snippet.
Another way is to partition the data into two sets, one to compare inoculum for fungicide=1, and one to compare fungicides for inoculum="inoculated", and analyze each separately. The primary difference between these two approaches is that the residual variance (and hence, SEs) in the one-way with five levels is based on the full dataset, whereas the variances in the analyses of data partitions are based on only the data used in the model. For a well-behaved dataset (notably one with homogeneous variances), the stories from the results will be similar. All other things equal, variances based on more data are better, so that's an advantage of the one-way with 5 treatments. The disadvantage of the one-way with 5 treatments is that you must identify the actual hypotheses of interest (which are the various hypotheses tested in the two partitions) and code LSMESTIMATE statements to estimate and test them. For this study, that will not be that hard. (See CONTRAST and ESTIMATE Statements Made Easy: The LSMESTIMATE Statement by Kiernan et al.) For more complicated designs, I've sometimes taken the partition approach.
/* Assess effects of 5 different treatments: combinations of fungicide and inoculum */
proc glimmix plots=(studentpanel boxplot) data=wheat;
class Fungicide Inoculum Block;
model Shoot_Weight = Fungicide*Inoculum / ddfm=kr dist=normal;
random Block;
lsmeans Fungicide*inoculum / lines;
/* Add appropriate lsmestimate statements to test hypotheses of interest */
run;
/* Compare untreated inoc to untreatment non-inoc to assess effect of inoculum
using a partition of the dataset */
proc glimmix plots=(studentpanel boxplot) data=wheat(where = (fungicide = 1)) nobound;
class Fungicide Inoculum Block;
model Shoot_Weight = Inoculum / ddfm=kr dist=normal;
random Block;
lsmeans Inoculum / lines;
run;
/* Assess effects of fungicide, given inoculum
using a partition of the dataset */
proc glimmix plots=(studentpanel boxplot) data=wheat(where = (inoculum = "inoculated")) nobound;
class Fungicide Inoculum Block;
model Shoot_Weight = Fungicide / ddfm=kr dist=normal;
random Block;
lsmeans fungicide / pdiff adjust=tukey lines;
run;
For the binary emergence data, you need a modification of the design structure of your model to accommodate the 18 subsamples in each plot in each block, something like
/* Assess effects of 5 different treatments: combinations of fungicide and inoculum */
proc glimmix plots=(studentpanel boxplot) data=wheat;
class Fungicide Inoculum Block;
model Emergence = Fungicide*Inoculum / dist=binomial;
random Block
Block*Fungicide*Inoculum;
lsmeans Fungicide*Inoculum / lines ilink;
/* Add appropriate lsmestimate statements to test hypotheses of interest */
run;
Regarding your bonus question: Steve has many excellent insights, and I agree with them all. I don't have an answer to your question, just more questions. Are these field experiments? Is the second experiment in a different location and/or a different year? What do you see as the purpose of the second experiment? If it is intended as a true replication, what is the statistical population to which you make inference? (Replications are, theoretically, a random sample from that population. Is a sample size of two big enough to be adequately representative?) Or do you think of it more as an exercise in repeatability across space and time? Read The many faces of replication by Johnson in Crop Science; it is an excellent resource for thinking about such things.
I hope this helps move you forward.
Thanks for the prompt responses @SteveDenham and @sld.
Would it be incorrect to generalize the analysis to a one-factor rbd despite having an incomplete factorial with two factors, inoculum and fungicide? I did consider this prior to posting and ran the analysis as mentioned above but also ran it simply as treatment with 5 levels (one-factor). Results look largely the same but I do have some concerns about this. One noted difference is a results of my LSD, where the estimates are the same, but significance is slightly different when running the analysis as two factors versus one factor. Is one method more "correct" than the other?
If I'm understanding this right, by turning this into a one-factor analysis, my significant indicator (p-value) is the interaction term or close to an interaction term. When running it as a two-factor analysis, I am provided with the main effects and there associated significance, but fail to receive an estimate and significance for the interaction term. Would it be improper to use the one-factor analysis for my interaction and the two-factor analysis for my main effects? All of these p-values are highly significant (p<0.0001) so there isn't much need to interpret the main effects due to the significance found within the interaction term. But in theory would this be incorrect under these circumstances? See the images below.
To answer your other questions about the bonus question. I did read that paper which I found to be very insightful, but got the impression that pseudoreplication was always a bad or improper form of replication. In my case I don't find this to be true. I would suggest that pseudoreplication of a binary measurement provides greater precision of the treatment. I'm sure there is some reason I am unaware of for why this shouldn't be included in my analysis, which I would be OK with because there are work arounds. I could simply average the binary response variable of the pseudoreplicates and use this average as the response variable. For example 10 of my 18 seedlings emerged in one of the treatments so I'd use 0.55 and ditch the "rep" as one my variables and simplify the analysis whilst providing myself a precise treatment response. Sorry if that is not articulated clearly enough.
In response to some of your other questions, my replication would be metareplication where I am simply showing repetition and repeatability. These are greenhouse trials, same location and same year, so more of an exercise of repeatability across time than space perhaps. I understand ideally this would be done in another growing season and another location but I have some limitation. Is the suggestion here that I consider this block 9, 10...to 18 and just add this into the analysis? I was under the impression I would need to include this as another variable to show the "contrast" for lack of better words between the experimental replicate 1 versus 2. Just adding blocks sure would make the analysis easier, just want to make sure I'm following modern/conventional statistical procedures the best I can.
One other question I had was in regard to my error terms. Is there a helpful source on determining these error terms dependent on my design and which factors I choose to run. I know if I do a one-factor rbd or a two factor rbd its just block. But for the binary study as a two-factor with replicates and the binary study if I just do one-factor or average the replicates is where I have some confusion.
main effects and interactions
LSD
Would it be incorrect to generalize the analysis to a one-factor rbd despite having an incomplete factorial with two factors, inoculum and fungicide?
It would fine. In fact, it would be an appropriate way to deal with the incomplete factorial if you also included appropriate contrasts to test (1) the effects of fungicide (Edit: including control when combined with inoculum), and (2) the effect of inoculum for the control. Edit: These contrasts could be a subset of pairwise comparisons among the 5 means.
If I'm understanding this right, by turning this into a one-factor analysis, my significant indicator (p-value) is the interaction term or close to an interaction term.
Your interpretation is not correct. The p-value for treatment in the one-factor analysis addresses the alternative hypothesis that at least one of the five means is different than at least one other of the five means; it is not a test of interaction. With the 5 treatments in this study, you are not actually able to address "interaction" if we are defining interaction as an effect of inoculum that differs among fungicides (or vice versa: effects of fungicide that differ depending on presence or absence of inoculum). You can compare fungicides given inoculum; you can compare inoculum given control. I don't see that you have a pertinent experimental setup for any sort of statistical interaction of fungicide and inoculum.
Would it be improper to use the one-factor analysis for my interaction and the two-factor analysis for my main effects?
Yes, it would be incorrect.
To answer your other questions about the bonus question. I did read that paper which I found to be very insightful, but got the impression that pseudoreplication was always a bad or improper form of replication. In my case I don't find this to be true. I would suggest that pseudoreplication of a binary measurement provides greater precision of the treatment. I'm sure there is some reason I am unaware of for why this shouldn't be included in my analysis, which I would be OK with because there are work arounds. I could simply average the binary response variable of the pseudoreplicates and use this average as the response variable. For example 10 of my 18 seedlings emerged in one of the treatments so I'd use 0.55 and ditch the "rep" as one my variables and simplify the analysis whilst providing myself a precise treatment response.
Pseudoreplication is always bad. Subsamples are not bad, it's all in how you incorporate them into the statistical model; pseudoreplication means that you are using subsamples as if they are true replications. For your study, you have two options. (1) You can analyze binary data measured at the subsample-level and include a random effect for plot in addition to block, to accommodate the clustering of 18 measurements within a plot; the plot effect can be specified as Block*Fungicide*Inoculum as in the code I suggested previously. Or (2) you can composite the 18 binary responses for a plot into a binomial response (i.e., number emerged out of 18) which is now a plot-level response and so you no longer include a random effect for plot in the statistical model (although you do now need to consider the distinct possibility of overdispersion). This is like your suggestion of an "average" (which is a proportion), but the binomial distribution is theoretically preferable to analyzing proportion data as a continuous variable (e.g., transforming and then assuming normality).
See
The arcsine is asinine: the analysis of proportions in ecology
Rethinking the Analysis of Non‐Normal Data in Plant and Soil Science
In response to some of your other questions, my replication would be metareplication where I am simply showing repetition and repeatability. These are greenhouse trials, same location and same year, so more of an exercise of repeatability across time than space perhaps. I understand ideally this would be done in another growing season and another location but I have some limitation. Is the suggestion here that I consider this block 9, 10...to 18 and just add this into the analysis? I was under the impression I would need to include this as another variable to show the "contrast" for lack of better words between the experimental replicate 1 versus 2. Just adding blocks sure would make the analysis easier, just want to make sure I'm following modern/conventional statistical procedures the best I can.
I would not recommend treating the second experiment as additional blocks in the first experiment. That approach is not consistent with the experimental design, and it ignores the possibility that the effects of fungicide or inoculum may differ between experiments. (In randomized block designs, we assume that there is no block by treatment interaction.) With only two experiments, in a greenhouse, the first in early season and the second in later season (and where season may matter and is unreplicated), I probably would analyze each experiment separately and subjectively assess whether results are qualitatively similar.
One other question I had was in regard to my error terms. Is there a helpful source on determining these error terms dependent on my design and which factors I choose to run. I know if I do a one-factor rbd or a two factor rbd its just block. But for the binary study as a two-factor with replicates and the binary study if I just do one-factor or average the replicates is where I have some confusion.
Generalized linear mixed models are complex. Start with the Stroup paper linked above, and then move to his text Generalized Linear Mixed Models: Modern Concepts, Methods and Applications . This text Analysis of Generalized Linear Mixed Models in the Agricultural and Natural Resources Sciences goes into much less detail than the Stroup text and may be more accessible initially.
I hope this helps move you forward.
Thanks @sld for covering all of this. Great answers, although I still think handling the second experiment as additional blocks can be justified. It wouldn't be an RCBD, but rather a strip-block design, I think, where there are two blocking factors treated as one. I would have to dig through SAS for Mixed Models to find code to analyze the design.
So 👍👍👍
SteveDenham
Thanks, @SteveDenham, for the kudos.
I'm not seeing the strip-block idea, so I'd be intrigued to see what you might work up.
Greenhouse experiments can be more variable than one might hope or assume, and some greenhouses are better at controlled conditions than are others. So I would not necessarily assume that environmental conditions were similar between the two experiments, and different environmental conditions might differentially affect emergence and growth.
Experiment as a replication could be incorporated in a statistical model as a random effects factor--essentially a super-block of blocks. But then you are trying to estimate a variance among experiments based on only two observations, and blocks within an experiment play the role of subsamples and not true replications.
Experiment could be incorporated as a fixed effect, "randomly" assigned to blocks. This approach could incorporate tests of experiment by treatment interactions; one could argue that those tests are not valid because blocks are not independent replications of a level of "experiment". And experiment as a fixed effect could be difficult to interpret: experiment is just a catch-all for whatever varies among experiments (e.g., light quality, humidity, temperature, day length).
Separate analyses could be done on each experiment. Granted, the comparison of two experiments is subjective, but that seems commensurate with having only two experiments.
This is a nice paper that might indirectly address some of our discussion points:
Casler( 2015 ) Fundamentals of Experimental Design: Guidelines for Designing Successful Experiments
Thanks for bringing attention to this post and providing insight @SteveDenham.
Thank you @sld for your initial post and the continued input to my concerns and questions. I feel comfortable moving forward with this analysis now and learned a great deal.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.