Solved: Re: generalized and general linear mixed effects model on RBD

batecy36 · Posted 08-03-2020 09:04 PM

Hey SAS community,

I am working on the analysis for a study that I am currently replicating. A bit of background on this study. It is an RBD looking at the efficacy of three fungicides on wheat. The response variables are emergence which is binary (0 or 1) and plot biomass which is continuous. The explanatory variables are fungicide and inoculum. I have broken this into two different data sets, one for the biomass and one for emergence and analyzing them separately. There are 8 blocks and there is subsampling within the emergence (n=18 per treatment) and the plot biomass is a summed biomass measurement for each treatment. I have datasets included and I will start with my progress on the emergence study.

I would encourage you to check the dataset I have attached and below I have posted the formatting for how I structured my code and then my actual code. This is followed by some of my results and what issues I am addressing.

Format:

Proc glimmix data=;

class alpha beta block rep;
model y/n = alpha|beta / dist=bin link=logit;
random block block*alpha*beta*rep;

run;

code used:

data wheat_emerg;
input Inoculum $	Fungicide $	Emergence	Block Rep;
cards;
non-inoculated	untreated	1	1	1
non-inoculated	untreated	1	1	2
non-inoculated	untreated	1	1	3
non-inoculated	untreated	1	1	4
non-inoculated	untreated	1	1	5
non-inoculated	untreated	1	1	6
non-inoculated	untreated	1	1	7
non-inoculated	untreated	1	1	8
non-inoculated	untreated	1	1	9
non-inoculated	untreated	1	1	10
non-inoculated	untreated	1	1	11
non-inoculated	untreated	1	1	12
non-inoculated	untreated	1	1	13
non-inoculated	untreated	0	1	14
non-inoculated	untreated	0	1	15
non-inoculated	untreated	0	1	16
non-inoculated	untreated	0	1	17
non-inoculated	untreated	0	1	18
Fusarium	untreated	1	1	1
Fusarium	untreated	1	1	2
Fusarium	untreated	1	1	3
.....
run;

proc print data=wheat_emerg;
run;

proc glimmix plots=residualpanel data=wheat_emerg;
class Fungicide Inoculum Block Rep;
model  Emergence=Fungicide|Inoculum/ddfm=kr dist=binomial link=logit;
random  Block Block*Fungicide*Inoculum*Rep; 
lsmeans Fungicide/lines ilink;
lsmeans Inoculum/lines ilink;
lsmeans Fungicide*Inoculum/lines ilink slicediff=Inoculum slicediff=Fungicide plots=(mean(clband connect sliceby=Fungicide));
run;

results:

result 1.png result 3.png result 4.png result 5.png result 6.png

My concerns:

I am not quite confident I have provided the correct random variables (Block and Block*Inoculum*Fungicide*Rep). This obviously influences my estimates resulting in wildly different results for my LSD.
Why are my least square means tables non-estimable for the fungicides and the non-inoculated?
Lastly, the residual tables look off confirming that I have done something incorrectly here. Or does these tables look to be as expected with a binary/non-normal analysis?

Wheat biomass

Format used:

Proc glimmix data=;

class alpha beta block;

model y = alpha|beta;

random block;

run;

code:

data wheat_shoot;
input Inoculum $	Fungicide $	Shoot_Weight	Block;
cards;
non-inoculated	untreated	9.9	1
Fusarium	untreated	3.6	1
Fusarium	Maxim	9.5	1
Fusarium	Saltro	9.8	1
Fusarium	Tymirium	9	1
non-inoculated	untreated	9.5	2
Fusarium	untreated	2.3	2
Fusarium	Maxim	10.2	2
Fusarium	Saltro	6.5	2
Fusarium	Tymirium	8.1	2
non-inoculated	untreated	8	3
Fusarium	untreated	3.3	3
Fusarium	Maxim	8.5	3
Fusarium	Saltro	4.7	3
Fusarium	Tymirium	5.3	3
……
run;

proc print data=wheat_shoot;
run;

proc glimmix plots=residualpanel data=wheat_shoot;
class Fungicide Inoculum Block;
model  Shoot_Weight=Fungicide|Inoculum/ddfm=kr dist=normal;
random  Block;
lsmeans Fungicide/lines;
lsmeans Inoculum/lines;
lsmeans Fungicide*Inoculum/lines slicediff=Inoculum slicediff=Fungicide plots=(mean(clband connect sliceby=Fungicide));
run;

concerns:

As noted in the previous model I was getting values that were non-estimable which I once again have here.

Bonus question!

As mentioned, I will be replicating this experiment. Any advice on how I would incorporate a replicate of the entire study into the analysis? I’m assuming I just need add another variable. Is this considered a random term and treated in the same fashion as “Block”? Thanks for any advice here.

Thanks a ton to everyone who took the time to sift through this lengthy post and provide feedback. I will be checking this daily for the coming weeks. Hopefully it’s an easy fix!

sld · Posted 08-04-2020 03:19 PM

As @SteveDenham notes, you have an incomplete factorial for fungicide and inoculum factors, and one way to analyze these data is to make one factor with five levels, as Steve illustrates in his code snippet.

Another way is to partition the data into two sets, one to compare inoculum for fungicide=1, and one to compare fungicides for inoculum="inoculated", and analyze each separately. The primary difference between these two approaches is that the residual variance (and hence, SEs) in the one-way with five levels is based on the full dataset, whereas the variances in the analyses of data partitions are based on only the data used in the model. For a well-behaved dataset (notably one with homogeneous variances), the stories from the results will be similar. All other things equal, variances based on more data are better, so that's an advantage of the one-way with 5 treatments. The disadvantage of the one-way with 5 treatments is that you must identify the actual hypotheses of interest (which are the various hypotheses tested in the two partitions) and code LSMESTIMATE statements to estimate and test them. For this study, that will not be that hard. (See CONTRAST and ESTIMATE Statements Made Easy: The LSMESTIMATE Statement by Kiernan et al.) For more complicated designs, I've sometimes taken the partition approach.

/*  Assess effects of 5 different treatments: combinations of fungicide and inoculum */
proc glimmix plots=(studentpanel boxplot) data=wheat;
    class Fungicide Inoculum Block;
    model  Shoot_Weight = Fungicide*Inoculum / ddfm=kr dist=normal;
    random  Block;
    lsmeans Fungicide*inoculum / lines;
    /*  Add appropriate lsmestimate statements to test hypotheses of interest */
    run;

/*  Compare untreated inoc to untreatment non-inoc to assess effect of inoculum
    using a partition of the dataset */
proc glimmix plots=(studentpanel boxplot) data=wheat(where = (fungicide = 1)) nobound;
    class Fungicide Inoculum Block;
    model  Shoot_Weight = Inoculum / ddfm=kr dist=normal;
    random  Block;
    lsmeans Inoculum / lines;
    run;

/*  Assess effects of fungicide, given inoculum 
    using a partition of the dataset */
proc glimmix plots=(studentpanel boxplot) data=wheat(where = (inoculum = "inoculated")) nobound;
    class Fungicide Inoculum Block;
    model  Shoot_Weight = Fungicide / ddfm=kr dist=normal;
    random  Block;
    lsmeans fungicide / pdiff adjust=tukey lines;
    run;

For the binary emergence data, you need a modification of the design structure of your model to accommodate the 18 subsamples in each plot in each block, something like

/*  Assess effects of 5 different treatments: combinations of fungicide and inoculum */
proc glimmix plots=(studentpanel boxplot) data=wheat;
    class Fungicide Inoculum Block;
    model  Emergence = Fungicide*Inoculum / dist=binomial;
    random  Block
            Block*Fungicide*Inoculum;
    lsmeans Fungicide*Inoculum / lines ilink;
    /*  Add appropriate lsmestimate statements to test hypotheses of interest */
    run;

Regarding your bonus question: Steve has many excellent insights, and I agree with them all. I don't have an answer to your question, just more questions. Are these field experiments? Is the second experiment in a different location and/or a different year? What do you see as the purpose of the second experiment? If it is intended as a true replication, what is the statistical population to which you make inference? (Replications are, theoretically, a random sample from that population. Is a sample size of two big enough to be adequately representative?) Or do you think of it more as an exercise in repeatability across space and time? Read The many faces of replication by Johnson in Crop Science; it is an excellent resource for thinking about such things.

I hope this helps move you forward.

View solution in original post

batecy36 · Posted 08-03-2020 09:10 PM

Forgot to include the data set which can be found here

SteveDenham · Posted 08-04-2020 07:51 AM

The non-estimable part arises from the design - you have only fungicide=non-treated with inoculum=non-inoculated. Really quickly, find a copy of Milliken and Johnson's Analysis of Messy Data, vol. 1. In there you will find what they define as a "means model". That is essentially a one-way model, defined by the interaction. Your analysis would go forward using that as the fixed effect, and then pulling out the values and tests of interest using LSMESTIMATE statements. This would be a start:

proc glimmix plots=residualpanel data=wheat_emerg;
class Fungicide Inoculum Block Rep;
model  Emergence=Fungicide*Inoculum/ddfm=kr dist=binomial link=logit;
random  intercept Fungicide*Inoculum*Rep/subject=block; 
lsmeans Fungicide*Inoculum/ ilink ;
/* Insert lsmestimate statements here to combine cell means to get overall fungicide and inoculum means */
/* For differences, you will likely need to use the %Margins macro to correctly calculate the standard errors of the differences, 
but for now, you can construct them by taking the difference between the lsmestimates for the four instances */
run;

Your bonus question is an area of a lot of contention. Since you likely only have 2 levels, treating the replication at this level as a random effect may not give you an estimate, or will lead to a message in the output that the G matrix is not positive definite.

If you are exactly duplicating the study, you might consider adding additional blocks and reps within blocks as obtained from the new study to the existing ones, or fitting it as a fixed effect, and calling it something like 'study_year';

For this particular area, there is a great resource available in this community, so I am putting out a call for @sld . She may have way more insight into analyzing studies with a design like this.

SteveDenham

sld · Posted 08-04-2020 03:19 PM

As @SteveDenham notes, you have an incomplete factorial for fungicide and inoculum factors, and one way to analyze these data is to make one factor with five levels, as Steve illustrates in his code snippet.

Another way is to partition the data into two sets, one to compare inoculum for fungicide=1, and one to compare fungicides for inoculum="inoculated", and analyze each separately. The primary difference between these two approaches is that the residual variance (and hence, SEs) in the one-way with five levels is based on the full dataset, whereas the variances in the analyses of data partitions are based on only the data used in the model. For a well-behaved dataset (notably one with homogeneous variances), the stories from the results will be similar. All other things equal, variances based on more data are better, so that's an advantage of the one-way with 5 treatments. The disadvantage of the one-way with 5 treatments is that you must identify the actual hypotheses of interest (which are the various hypotheses tested in the two partitions) and code LSMESTIMATE statements to estimate and test them. For this study, that will not be that hard. (See CONTRAST and ESTIMATE Statements Made Easy: The LSMESTIMATE Statement by Kiernan et al.) For more complicated designs, I've sometimes taken the partition approach.

/*  Assess effects of 5 different treatments: combinations of fungicide and inoculum */
proc glimmix plots=(studentpanel boxplot) data=wheat;
    class Fungicide Inoculum Block;
    model  Shoot_Weight = Fungicide*Inoculum / ddfm=kr dist=normal;
    random  Block;
    lsmeans Fungicide*inoculum / lines;
    /*  Add appropriate lsmestimate statements to test hypotheses of interest */
    run;

/*  Compare untreated inoc to untreatment non-inoc to assess effect of inoculum
    using a partition of the dataset */
proc glimmix plots=(studentpanel boxplot) data=wheat(where = (fungicide = 1)) nobound;
    class Fungicide Inoculum Block;
    model  Shoot_Weight = Inoculum / ddfm=kr dist=normal;
    random  Block;
    lsmeans Inoculum / lines;
    run;

/*  Assess effects of fungicide, given inoculum 
    using a partition of the dataset */
proc glimmix plots=(studentpanel boxplot) data=wheat(where = (inoculum = "inoculated")) nobound;
    class Fungicide Inoculum Block;
    model  Shoot_Weight = Fungicide / ddfm=kr dist=normal;
    random  Block;
    lsmeans fungicide / pdiff adjust=tukey lines;
    run;

For the binary emergence data, you need a modification of the design structure of your model to accommodate the 18 subsamples in each plot in each block, something like

/*  Assess effects of 5 different treatments: combinations of fungicide and inoculum */
proc glimmix plots=(studentpanel boxplot) data=wheat;
    class Fungicide Inoculum Block;
    model  Emergence = Fungicide*Inoculum / dist=binomial;
    random  Block
            Block*Fungicide*Inoculum;
    lsmeans Fungicide*Inoculum / lines ilink;
    /*  Add appropriate lsmestimate statements to test hypotheses of interest */
    run;

Regarding your bonus question: Steve has many excellent insights, and I agree with them all. I don't have an answer to your question, just more questions. Are these field experiments? Is the second experiment in a different location and/or a different year? What do you see as the purpose of the second experiment? If it is intended as a true replication, what is the statistical population to which you make inference? (Replications are, theoretically, a random sample from that population. Is a sample size of two big enough to be adequately representative?) Or do you think of it more as an exercise in repeatability across space and time? Read The many faces of replication by Johnson in Crop Science; it is an excellent resource for thinking about such things.

I hope this helps move you forward.

batecy36 · Posted 08-07-2020 07:04 PM

Thanks for the prompt responses @SteveDenham and @sld.

Would it be incorrect to generalize the analysis to a one-factor rbd despite having an incomplete factorial with two factors, inoculum and fungicide? I did consider this prior to posting and ran the analysis as mentioned above but also ran it simply as treatment with 5 levels (one-factor). Results look largely the same but I do have some concerns about this. One noted difference is a results of my LSD, where the estimates are the same, but significance is slightly different when running the analysis as two factors versus one factor. Is one method more "correct" than the other?

If I'm understanding this right, by turning this into a one-factor analysis, my significant indicator (p-value) is the interaction term or close to an interaction term. When running it as a two-factor analysis, I am provided with the main effects and there associated significance, but fail to receive an estimate and significance for the interaction term. Would it be improper to use the one-factor analysis for my interaction and the two-factor analysis for my main effects? All of these p-values are highly significant (p<0.0001) so there isn't much need to interpret the main effects due to the significance found within the interaction term. But in theory would this be incorrect under these circumstances? See the images below.

To answer your other questions about the bonus question. I did read that paper which I found to be very insightful, but got the impression that pseudoreplication was always a bad or improper form of replication. In my case I don't find this to be true. I would suggest that pseudoreplication of a binary measurement provides greater precision of the treatment. I'm sure there is some reason I am unaware of for why this shouldn't be included in my analysis, which I would be OK with because there are work arounds. I could simply average the binary response variable of the pseudoreplicates and use this average as the response variable. For example 10 of my 18 seedlings emerged in one of the treatments so I'd use 0.55 and ditch the "rep" as one my variables and simplify the analysis whilst providing myself a precise treatment response. Sorry if that is not articulated clearly enough.

In response to some of your other questions, my replication would be metareplication where I am simply showing repetition and repeatability. These are greenhouse trials, same location and same year, so more of an exercise of repeatability across time than space perhaps. I understand ideally this would be done in another growing season and another location but I have some limitation. Is the suggestion here that I consider this block 9, 10...to 18 and just add this into the analysis? I was under the impression I would need to include this as another variable to show the "contrast" for lack of better words between the experimental replicate 1 versus 2. Just adding blocks sure would make the analysis easier, just want to make sure I'm following modern/conventional statistical procedures the best I can.

One other question I had was in regard to my error terms. Is there a helpful source on determining these error terms dependent on my design and which factors I choose to run. I know if I do a one-factor rbd or a two factor rbd its just block. But for the binary study as a two-factor with replicates and the binary study if I just do one-factor or average the replicates is where I have some confusion.

batecy36 · Posted 08-07-2020 07:15 PM

main effects and interactionstwo-factor weightone-factor weighttwo-factor emergenceone-factor emergence

batecy36 · Posted 08-07-2020 07:17 PM

LSD

one-factor emergencetwo-factor weighttwo-factor emergenceone-factor weight

sld · Posted 08-09-2020 11:56 PM

Would it be incorrect to generalize the analysis to a one-factor rbd despite having an incomplete factorial with two factors, inoculum and fungicide?

It would fine. In fact, it would be an appropriate way to deal with the incomplete factorial if you also included appropriate contrasts to test (1) the effects of fungicide (Edit: including control when combined with inoculum), and (2) the effect of inoculum for the control. Edit: These contrasts could be a subset of pairwise comparisons among the 5 means.

If I'm understanding this right, by turning this into a one-factor analysis, my significant indicator (p-value) is the interaction term or close to an interaction term.

Your interpretation is not correct. The p-value for treatment in the one-factor analysis addresses the alternative hypothesis that at least one of the five means is different than at least one other of the five means; it is not a test of interaction. With the 5 treatments in this study, you are not actually able to address "interaction" if we are defining interaction as an effect of inoculum that differs among fungicides (or vice versa: effects of fungicide that differ depending on presence or absence of inoculum). You can compare fungicides given inoculum; you can compare inoculum given control. I don't see that you have a pertinent experimental setup for any sort of statistical interaction of fungicide and inoculum.

Would it be improper to use the one-factor analysis for my interaction and the two-factor analysis for my main effects?

Yes, it would be incorrect.

To answer your other questions about the bonus question. I did read that paper which I found to be very insightful, but got the impression that pseudoreplication was always a bad or improper form of replication. In my case I don't find this to be true. I would suggest that pseudoreplication of a binary measurement provides greater precision of the treatment. I'm sure there is some reason I am unaware of for why this shouldn't be included in my analysis, which I would be OK with because there are work arounds. I could simply average the binary response variable of the pseudoreplicates and use this average as the response variable. For example 10 of my 18 seedlings emerged in one of the treatments so I'd use 0.55 and ditch the "rep" as one my variables and simplify the analysis whilst providing myself a precise treatment response.

Pseudoreplication is always bad. Subsamples are not bad, it's all in how you incorporate them into the statistical model; pseudoreplication means that you are using subsamples as if they are true replications. For your study, you have two options. (1) You can analyze binary data measured at the subsample-level and include a random effect for plot in addition to block, to accommodate the clustering of 18 measurements within a plot; the plot effect can be specified as Block*Fungicide*Inoculum as in the code I suggested previously. Or (2) you can composite the 18 binary responses for a plot into a binomial response (i.e., number emerged out of 18) which is now a plot-level response and so you no longer include a random effect for plot in the statistical model (although you do now need to consider the distinct possibility of overdispersion). This is like your suggestion of an "average" (which is a proportion), but the binomial distribution is theoretically preferable to analyzing proportion data as a continuous variable (e.g., transforming and then assuming normality).

See

The arcsine is asinine: the analysis of proportions in ecology

Rethinking the Analysis of Non‐Normal Data in Plant and Soil Science

In response to some of your other questions, my replication would be metareplication where I am simply showing repetition and repeatability. These are greenhouse trials, same location and same year, so more of an exercise of repeatability across time than space perhaps. I understand ideally this would be done in another growing season and another location but I have some limitation. Is the suggestion here that I consider this block 9, 10...to 18 and just add this into the analysis? I was under the impression I would need to include this as another variable to show the "contrast" for lack of better words between the experimental replicate 1 versus 2. Just adding blocks sure would make the analysis easier, just want to make sure I'm following modern/conventional statistical procedures the best I can.

I would not recommend treating the second experiment as additional blocks in the first experiment. That approach is not consistent with the experimental design, and it ignores the possibility that the effects of fungicide or inoculum may differ between experiments. (In randomized block designs, we assume that there is no block by treatment interaction.) With only two experiments, in a greenhouse, the first in early season and the second in later season (and where season may matter and is unreplicated), I probably would analyze each experiment separately and subjectively assess whether results are qualitatively similar.

One other question I had was in regard to my error terms. Is there a helpful source on determining these error terms dependent on my design and which factors I choose to run. I know if I do a one-factor rbd or a two factor rbd its just block. But for the binary study as a two-factor with replicates and the binary study if I just do one-factor or average the replicates is where I have some confusion.

Generalized linear mixed models are complex. Start with the Stroup paper linked above, and then move to his text Generalized Linear Mixed Models: Modern Concepts, Methods and Applications . This text Analysis of Generalized Linear Mixed Models in the Agricultural and Natural Resources Sciences goes into much less detail than the Stroup text and may be more accessible initially.

I hope this helps move you forward.

SteveDenham · Posted 08-10-2020 08:50 AM

Thanks @sld for covering all of this. Great answers, although I still think handling the second experiment as additional blocks can be justified. It wouldn't be an RCBD, but rather a strip-block design, I think, where there are two blocking factors treated as one. I would have to dig through SAS for Mixed Models to find code to analyze the design.

So 👍👍👍

SteveDenham

sld · Posted 08-10-2020 01:49 PM

Thanks, @SteveDenham, for the kudos.

I'm not seeing the strip-block idea, so I'd be intrigued to see what you might work up.

Greenhouse experiments can be more variable than one might hope or assume, and some greenhouses are better at controlled conditions than are others. So I would not necessarily assume that environmental conditions were similar between the two experiments, and different environmental conditions might differentially affect emergence and growth.

Experiment as a replication could be incorporated in a statistical model as a random effects factor--essentially a super-block of blocks. But then you are trying to estimate a variance among experiments based on only two observations, and blocks within an experiment play the role of subsamples and not true replications.

Experiment could be incorporated as a fixed effect, "randomly" assigned to blocks. This approach could incorporate tests of experiment by treatment interactions; one could argue that those tests are not valid because blocks are not independent replications of a level of "experiment". And experiment as a fixed effect could be difficult to interpret: experiment is just a catch-all for whatever varies among experiments (e.g., light quality, humidity, temperature, day length).

Separate analyses could be done on each experiment. Granted, the comparison of two experiments is subjective, but that seems commensurate with having only two experiments.

This is a nice paper that might indirectly address some of our discussion points:

Casler( 2015 ) Fundamentals of Experimental Design: Guidelines for Designing Successful Experiments

batecy36 · Posted 08-12-2020 11:40 AM

Thanks for bringing attention to this post and providing insight @SteveDenham.

Thank you @sld for your initial post and the continued input to my concerns and questions. I feel comfortable moving forward with this analysis now and learned a great deal.