turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Baseline Covariate + Repeated Measure Analysis for...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-29-2017 11:44 AM

Hi All,

I am a bit puzzled on how to carry out the analysis I am hoping to. A bit about my study and data design, as well as what I have been able to successfully accomplish so far and where I think I am getting stuck:

This data is from an agronomy field study - RCBD w/ split-plots and repeated mesures. I have 4 blocks ("Rep"), with 2 Irrigation treatments ("Irr_Trt") as the whole plot factors and 6 Nitrogen treatments ("N_Trt") as the split-plot factors. The response variable I am interested in measuring is nitrate concentration ("Concentration_NO3") in the soil water solution which is measured on 25 dates ("Sample_Date") over the course of the experiement (which is a single growing season); the response variable also has a lognormal distribution (beacuse concentrations cannot be less than zero...). I have a model that I feel very comfortable with, and believe that it is appropriate for the analysis of my data. The covariance structure I chose to use is the R-side CS + AR(1) type; I am by no means an GLMM expert, but after combing through jusst about everything Walter Stroup (and also among many other experts) has written, this seems to be the best fit for my data.

proc glimmix

nobound

data=work.B16

class Irr_Trt N_Trt Sample_Date Rep;

model Concentration_NO3 = Irr_Trt|N_Trt|Sample_Date/

dist=lognormal

ddfm=kenwardroger2;

random intercept Irr_Trt N_Trt*Irr_Trt/subject=Rep;

random Sample_Date/subject = N_Trt*Irr_Trt*Rep type=ar(1) residual;

run;

quit;

Now for the part where things get a bit more challenging for me... The CV for this data is quite large, ~50%, and I have been digging deeper into the data set itself for insights. One thing that escaped my initital review of the data, is that the first 4 dates when samples were collected occured prior to the imposition of either the irrigation or nitrogen treatments. In this sense, I have a initial value covariate ("NO3cov"), measured for each experimental unit (i.e. the split-plot or N_Trt*Irr_Trt*Rep unit). I believe that incorporation of this covariate (as a correction for initial differences in nitrate concentration) will resolve some of the unexplained variance in the data.

My plan is to re-run the analysis, by restricting my model to the data collected after 1 June 2016 (which was the date when treatments were initially imposed), and using the mean values of nitrate concentration for each experimental unit colelcted prior to that date as baseline covariates. This leaves 21 dates for inclusion in the repeated measures analysis, with the first 4 sample dates used as covariates. To start, I did some simple linear regressions of the measured nitrate concentration against the covariate for each sampling date and for the mean concentration over all sampling dates. There was a positive and significant relationship between the covariate and measured nitrate concentration for the mean values over the whole season; looking at each sampling date, there was a positive and significant relationship only for the first 6 sampling dates after the imposition of the treatments while the final 15 sampling dates did not have a signifiacnt relationship with the covariate.

Using this insight (and an example from SAS for Mixed Models (Littell et al. p. 186), here is the model I am now attempting to run (with changes from previous model in **bold**):

proc glimmix

nobound

data=work.B16

class Irr_Trt N_Trt Sample_Date Rep;

model Concentration_NO3 = Irr_Trt|N_Trt|Sample_Date **NO3cov NO3cov*Sample_Date**/

dist=lognormal

ddfm=kenwardroger2;

random intercept Irr_Trt N_Trt*Irr_Trt/subject=Rep;

random Sample_Date/subject = N_Trt*Irr_Trt*Rep type=ar(1) residual;

run;

quit;

In this analysis, I found that the NO3cov main effect was not significant (p(>F) = 0.27) but that the NO3cov*Sample_Date interaction effect was significant (p(>F) = 0.02). The AICC, however, increased from 1224 to 1300 with the inclusion of the covariate.

To date, I haven't seen a discussion on how to include a baseline covariate within a repeated measures and split-plot design. Having somewhat limited experience, I wanted to gather some feedback on what insights others might have on the analysis I am conducting. In particular, my outstanding questions and concerns are:

1) Is the baseline covariate parameterized correctly in the model statement? Do I need both the NO3 cov and NO3cov*Sample_Date to account for the fact that the relationship of the response variable to the covariate changes over time?

2) Do I need to change or include an additional random statement when I add the covariate? will the df be correct for the covariate effect with the specification that I have?

3) How does the inculsion of the covariate affect the existing repeated measures covariance structure?

4) Is this even an appropriate use of a covariate?

5) Is it a "problem" that the main effect of NO3cov is not significant while the interaction effect of NO3cov*Sample_Date is? What about the increase in AICC value when the covariate was inculded?

Thanks!

Brian

Accepted Solutions

Solution

06-06-2017
08:30 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-01-2017 12:57 AM

Kudos for a well-crafted question. I wish I had the answers to all of your questions.

1) Is the baseline covariate parameterized correctly in the model statement? Do I need both the NO3 cov and NO3cov*Sample_Date to account for the fact that the relationship of the response variable to the covariate changes over time?

I can see the value in using (the mean of the first four dates of) NO3cov as a covariate. I think it would make sense to use log(NO3cov) if you are using a lognormal distribution for Concentration_NO3. The model assumes a linear relationship between the link-scale of the response and the covariate.

NO3cov*Sample_Date allows the slope of the linear regression of Concentration_NO3 (on the link scale) to vary by Sample_Date. You should retain NO3cov if the interaction is in the model statement.

2) Do I need to change or include an additional random statement when I add the covariate? will the df be correct for the covariate effect with the specification that I have?

Adding a covariate into a mixed model is an appreciable complication, depending on the hierarchical level at which the covariate is measures (here, Rep*Irr_Trt*N_Trt, aka subplot); check out random coefficient models in the Littell et al. text (SAS for Mixed Models, 2nd ed), Stroup (Generalized Linear Mixed Models), and Milliken and Johnson (Analysis of Messy Data, Vol III Analysis of Covariance).

I always go into a model with some idea of what I think are appropriate denom df, just to check.

3) How does the inculsion of the covariate affect the existing repeated measures covariance structure?

Not sure about that.

4) Is this even an appropriate use of a covariate?

Google "change in baseline ancova".

5) Is it a "problem" that the main effect of NO3cov is not significant while the interaction effect of NO3cov*Sample_Date is? What about the increase in AICC value when the covariate was inculded?

Nope. Google "ancova centering". This text

https://books.google.com/books/about/Multiple_Regression.html?id=LcWLUyXcmnkC

has a nice introductory level discussion of centering.

Regarding the AICC increasing with the inclusion of the covariate: To compare models that differ in fixed effects factor, you need to use a true maximum likelihood method (LAPLACE or QUAD), rather than REML.

There are many possible different bells and whistles to be considered. Is the relationship linear? Does the slope vary with Irr_Trt and/or N_Trt? Is the slope constant for all whole plots, or does it vary?

Stepping back further, I would ask: What is your research question about sampling dates? 21 or 25 levels is a lot to sort out in an ANOVA context. Why did you make all these observations? Are you actually interested in comparing means among all 21 or 25? Would it make sense to extract some agronomically meaningful statistic from all these measurements to use as a response, or to regress on sampling date? Is there a statistician at your institution that you could work with, or a statistics prof, or a statistical consulting center? (I know, there may not be anyone, or at least anyone that knows more than you.) Lots of things to ponder, and a good stat colleague would be a more useful thing than this forum.

All Replies

Solution

06-06-2017
08:30 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-01-2017 12:57 AM

Kudos for a well-crafted question. I wish I had the answers to all of your questions.

1) Is the baseline covariate parameterized correctly in the model statement? Do I need both the NO3 cov and NO3cov*Sample_Date to account for the fact that the relationship of the response variable to the covariate changes over time?

I can see the value in using (the mean of the first four dates of) NO3cov as a covariate. I think it would make sense to use log(NO3cov) if you are using a lognormal distribution for Concentration_NO3. The model assumes a linear relationship between the link-scale of the response and the covariate.

NO3cov*Sample_Date allows the slope of the linear regression of Concentration_NO3 (on the link scale) to vary by Sample_Date. You should retain NO3cov if the interaction is in the model statement.

2) Do I need to change or include an additional random statement when I add the covariate? will the df be correct for the covariate effect with the specification that I have?

Adding a covariate into a mixed model is an appreciable complication, depending on the hierarchical level at which the covariate is measures (here, Rep*Irr_Trt*N_Trt, aka subplot); check out random coefficient models in the Littell et al. text (SAS for Mixed Models, 2nd ed), Stroup (Generalized Linear Mixed Models), and Milliken and Johnson (Analysis of Messy Data, Vol III Analysis of Covariance).

I always go into a model with some idea of what I think are appropriate denom df, just to check.

3) How does the inculsion of the covariate affect the existing repeated measures covariance structure?

Not sure about that.

4) Is this even an appropriate use of a covariate?

Google "change in baseline ancova".

5) Is it a "problem" that the main effect of NO3cov is not significant while the interaction effect of NO3cov*Sample_Date is? What about the increase in AICC value when the covariate was inculded?

Nope. Google "ancova centering". This text

https://books.google.com/books/about/Multiple_Regression.html?id=LcWLUyXcmnkC

has a nice introductory level discussion of centering.

Regarding the AICC increasing with the inclusion of the covariate: To compare models that differ in fixed effects factor, you need to use a true maximum likelihood method (LAPLACE or QUAD), rather than REML.

There are many possible different bells and whistles to be considered. Is the relationship linear? Does the slope vary with Irr_Trt and/or N_Trt? Is the slope constant for all whole plots, or does it vary?

Stepping back further, I would ask: What is your research question about sampling dates? 21 or 25 levels is a lot to sort out in an ANOVA context. Why did you make all these observations? Are you actually interested in comparing means among all 21 or 25? Would it make sense to extract some agronomically meaningful statistic from all these measurements to use as a response, or to regress on sampling date? Is there a statistician at your institution that you could work with, or a statistics prof, or a statistical consulting center? (I know, there may not be anyone, or at least anyone that knows more than you.) Lots of things to ponder, and a good stat colleague would be a more useful thing than this forum.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-06-2017 09:35 AM

Thank you for your response! I at least feel like I am on the right track, even if some of the finer details still need to be worked out. I wanted to add a couple of thoughts in response, if not for anything to help others who have stumbled across this post in search of getting their own related questions answered - browsing through forum discussion is where I have learned a great deal of what I know about GLMMs...

After I changed the covariate from NO3cov to log(NO3cov), the improvement in the model was appreciable. The standard errors for my estimated lsmeans decreased, and the AICC decreased to a value about equal to the baseline model without the covariate. I understand that using a metric like AICC is tricky in situations when pseudo-likelihood is involved and that the Laplace or Quad methods would be most appropriate. When I have enough computing downtime, I'll run one of these models because at this point the runtime is just too long for this model + data set... I also wanted to add that I moved away from the AR(1) + CS to an ANTE(1) R-side covariance structure. After doing some additional exploratory and diagnostic work, I informally found this structure to be the best fit.

Addressing some of the broader questions you pointed out at the end of the post, here are my thoughts on why I chose the approach that I did. Historically, this type of data (Concentration_NO3) has been used as an intermediate value in computing a response variable of agronomic interest (but which we do not directly measure) - NO3 leaching load. Concentrations are in units of mg/L while loads are in units of kg/ha. In short, NO3 load is calculated by multiplying a daily estimated NO3 concentration by a daily estimated flux of water downward through the soil (this is a whole other problem of estimate, fraught with its own issues...). We measure NO3 concentration on only a handful of sampling dates, due to logistical constraints. In previous analyses, the raw data for each experimental unit was interpolated across the growing season to obtain daily estimates of concentration for each experimental unit. In this manner, an estimate of NO3 leaching load was obtained for each experimental unit by multiplying the interpolated NO3 concentration by the daily estimated percolation below the root zone (i.e. downward water flux), and more standard split-plot mixed model analysis was appropriate.

However, serious red flags were raised for this method which questioned the reliability of this method of analysis. We found in some cases that our numerical estimates of NO3 load for the control N_Trt was higher than the high rate N_Trt, which was very likely an artifact of either a failed experiment or failed analysis. In other cases, we found no significant differences between any of the N_Trt or Irr_Trt which suggests that the power of this method was too low to be suitable.

We decided to try a new approach to this analysis, focusing on the measured NO3_Concentration as the response variable of primary importance rather than the calculated NO3 load. The thought was simply that we should make our statistical inferences before proceeding with calculations (more on this in a bit). And then an interpolation-type approach could proceed using the estimated means for Irr_Trt*N_Trt. Using a non-linear regression approach is possible, but the shape of the response curve is problematic. Within a given growing season, the response curve has a similar form for all treatments; however, the shape of the response curve is not similar across growing seasons. I have 5 years of similar data to analyze, and each has a distinct and non-linear curve for NO3_Concentration as a function of time. An analysis with time as a categorical, rather than continuous, variable seemed appropriate in this respect.

The goal is then to compare the treatment means on each Sample_Date, or at least to compare a given set of non-orthogonal contrast statements on each Sample_Date. I also sliced my analysis of the lsmeans on the Sample_Date effect to evaluate the N_Trt, Irr_Trt, and their interaction on each date. I didn't include this in my original post, but here is a partial snippet of code for reference:

lsmeans Irr_Trt N_Trt Irr_Trt*N_Trt Irr_Trt*Sample_Date N_Trt*Sample_Date Irr_Trt*N_Trt*Sample_Date/CL alpha=0.1 slice=Sample_Date;

contrast 'N1' N_Trt -5 1 1 1 1 1;

contrast 'N2' N_Trt 0 -1 -1 1 1 0;

contrast 'N3' N_Trt 0 -1 1 -1 1 0;

contrast 'N4' N_Trt 0 0 0 -1 -1 2;

contrast 'I*N1' Irr_Trt*N_Trt 5 -1 -1 -1 -1 -1 -5 1 1 1 1 1;

contrast 'I*N2' Irr_Trt*N_Trt 0 1 1 -1 -1 0 0 -1 -1 1 1 0;

contrast 'I*N3' Irr_Trt*N_Trt 0 1 -1 1 -1 0 0 -1 1 -1 1 0;

contrast 'I*N4' Irr_Trt*N_Trt 0 0 0 1 1 -2 0 0 0 -1 -1 2;

contrast 'N1 @ D1' N_Trt -5 1 1 1 1 1

N_Trt*Sample_Date [-5,1 1][1,2 1][1,3 1][1,4 1][1,5 1][1,6 1];

contrast 'N2 @ D1' N_Trt 0 -1 -1 1 1 0

N_Trt*Sample_Date [0,1 1][-1,2 1][-1,3 1][1,4 1][1,5 1][0,6 1];

contrast 'N3 @ D1' N_Trt 0 -1 1 -1 1 0

N_Trt*Sample_Date [0,1 1][-1,2 1][1,3 1][-1,4 1][1,5 1][0,6 1];

contrast 'N4 @ D1' N_Trt 0 0 0 -1 -1 2

N_Trt*Sample_Date [0,1 1][0,2 1][0,3 1][-1,4 1][-1,5 1][2,6 1];

The contrasts I chose reflect the a priori hypotheses that the experiment was designed with, relating to different nitrogen rate, source, and timing treatments. I am not interested, necessarily, in a full blown multiple comparisons of my treatment means on a given date, but rather if any of the specified contrasts are significant. Becuase of the high CV on this data, I wanted to be especially conservative when interpreting my results and before using the concentration data to make estimates of N-leaching load and to prevent a situation in which calculated N-leaching loads have numerical differences without any significant statistical differences. The mean comparisons of the contrast statements on each Sample_Date is effectively a protected version of the original computational procedure.

The next steps are to use the results from this analysis to compute NO3 load, which is the real response variable of interest. Although I will not be able to conduct a formal statistical analysis of the computed data, I should end up with a set of values for NO3 load where numerical differences represent true and significant differences between treatments. In the end, an analysis of the main effect of Sample_Date is just an intermediate procedure that matters for computational purposes, but by itself is not all that interesting.

If anyone else has thoughts on this approach or ways in which I could improve it, feel free to contribute!