BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
skr01
Calcite | Level 5

Hello!

The core of my question has to do with how to correctly code repeated measures when the repeated effect (here "sampling") isn't the same for all of the subjects. In context:

I have sampled 15 houses 3-8 times each for presence or absence of a bacterium. Within each "house" there are "real_locs" (specific physical real locations) Each of these physical locations that was sampled has a unique identifier. Real_loc is the subject upon which repeated measures were taken. Each real_loc falls into an "environment" category. Samplings happened in different "seasons". I believe "sampling" is the "repeated effects" predictor variable. -- Here's what confuses me: Each house was sampled up to 8 times approx 3 months apart, but the sampling does not begin at the same time for all houses; it is scattered across 2 years of start dates. So Sampling 1 does not correspond to the same time for different houses, but all the real_locs in a given house were sampled at the same time for Sampling 1 for that house (and so on for Sampling2, etc).

My questions that I want to incorporate in my model are:

a&b) Are there differences among types of environments and seasons in the probability of recovering our bacterium? (should be fixed effects)

c&d) Is there significant variation among houses in rates of recovery?, and Does recovery among environments vary across houses? (I think G-side random factors)

e) Is the probability of recovering the bacterium at a later sampling correlated with whether it was recovered there in the past? (in other words, not only do I know I need to accommodate the repeated measures in the model, I am actually interested in whether there is a "significant effect of repeated sampling in a location").

My code (based on my familiarity with PROC MIXED, the Users Guide to GLIMMIX, especially the pages on Repeated Measures, and a helpful exchange with SAS Tech Support which thank goodness, has resulted in each run trying things out with our real data (the data set is both large and very imbalanced) now taking 10-30 min to run, instead of hours) is:

proc glimmix data=mydata;

class house sampling environment real_loc season;

model Recovery=environment season sampling / dist=binary link=logit ddfm=residual;

random int environment / subject=house;

random sampling / subject=real_loc type=AR(1) residual;

covtest=wald;

nloptions tech=nrridg maxiter=250;

run;

I remain worried that I have something wrong with the repeated measures. I would have thought that sampling was a random factor, not fixed and/or that it would be nested within house, but all the sources I have consulted seem to suggest I am thinking about it wrong. So my questions are:

1) Does this code correspond to my questions (In particular, my question regarding repeated measures)?

2) If so, what is the significant F-test of the fixed effect "sampling" telling me?

3) Is it correct to interpret the signif CovParm "AR(1)" with subject "real_loc" as telling me that there is a significantly greater likelihood of finding the bacterium again if you found it somewhere once?

Thank you for your help and your time,

Susi

Message was edited by: Susanna Remold I should say, I realized there is an issue of using R-side random effects coding with logistic regression in GLIMMIX as discussed in Steve Denhem's post on R-side vs G-side from last April, and the discussion of repeated measures in trap data initiated by hornet1937, but I don't even know where to begin to modify the model to deal with that, so am leaving it as a separate issue for the time being.

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

I'll try at the last 3 questions there.

1.  I think this goes after your questions, provided sampling is 'aligned' within seasons.  If you could define sampling as maybe sample_month_year, you may have more than 8 levels, but the effect would be more reasonably modeled.  For instance, suppose house1 was sampled in 2011 in January, March and June, while house2 was sampled in June, September and November.  This would give rise to five levels of sample.  I just don't see an easier way to get at this across sampling units, as sample1 for house1 and sample1 for house2 don't align, while sample3 for house1 and sample1 for house2 do align.  If sampling in your model relates however to time following some sort of intervention (which may be as simple as you started taking samples), then it is probably OK.

2. The significant F test for sampling says that the means of the various sampling times are not equal--at least one differed from the others.

3. I kind of understand that interpretation, I would say that there is a significant correlation between the timepoints in the incidence rate, so your interpretation is quite a fair one.

Now on to G side stuff.  Remove the residual option from your second random statement, and add method=laplace (or method=quad, if you have lots of data) to the proc glimmix statement.  It is that easy.

Steve Denham

View solution in original post

10 REPLIES 10
SteveDenham
Jade | Level 19

I'll try at the last 3 questions there.

1.  I think this goes after your questions, provided sampling is 'aligned' within seasons.  If you could define sampling as maybe sample_month_year, you may have more than 8 levels, but the effect would be more reasonably modeled.  For instance, suppose house1 was sampled in 2011 in January, March and June, while house2 was sampled in June, September and November.  This would give rise to five levels of sample.  I just don't see an easier way to get at this across sampling units, as sample1 for house1 and sample1 for house2 don't align, while sample3 for house1 and sample1 for house2 do align.  If sampling in your model relates however to time following some sort of intervention (which may be as simple as you started taking samples), then it is probably OK.

2. The significant F test for sampling says that the means of the various sampling times are not equal--at least one differed from the others.

3. I kind of understand that interpretation, I would say that there is a significant correlation between the timepoints in the incidence rate, so your interpretation is quite a fair one.

Now on to G side stuff.  Remove the residual option from your second random statement, and add method=laplace (or method=quad, if you have lots of data) to the proc glimmix statement.  It is that easy.

Steve Denham

skr01
Calcite | Level 5

Dear Steve,

your response was a huge help. I'm not there yet but I think I finally have a grasp of what the final model might look like!!

Working backwards -- I'm glad I am interpreting the AR(1) covariance parameter Wald test appropriately, and glad to put that issue aside for now.

Your answer to my second question regarding interpreting an overall "sampling" effect to mean  that there are differences among samplings overall, confirms that I have been working with an incorrectly coded model -- that is nonsensical in the framework of the actual experimental design.

Your suggestions about recoding "sampling" are eye-opening, and I think I am finally getting closer to understanding. (hooray, and thank you). Samplings occurred every 3 months (once per season per year), and began when we enrolled subjects and ended when we reached 8 samplings, so clearly coding "sampling" as a fixed effect as I have been is incorrect. Based on your suggestion I created a new variable, "season_year". Each house was sampled up to 8 season_years, and the data set has 19 season_years, and 15 houses and 2305 real_locs, and a total of 11676 observations.  I ran a number of models just including house, season, season_year (i.e., excluding environment for now, for simplicity and to try to make things run faster/run at all).

What I found is that if I try to use a LaPlace method, I get the error "Integer overflow on computing amount of memory required", and SAS stops processing due to insufficient memory.

If I try to use the Quadrature method, I get the error "Estimation by quadrature is available only if the data can be processed by subjects. Make sure that all G-side RANDOM statements have SUBJECT=effect. If there are multiple SUBJECT= effects they need to form a containment hierarchy, e.g., SUBJECT=A, SUJECT=A*B, SUBJECT=A(B), …"

My current model has two lines:

random int / subject=house;

random year_season / subject=realloc type=AR(1);

My eventual model, once I put environment back in will include:

random int environment/ subject=house;

random year_season / subject=realloc type=AR(1);

I finally tried the R-side approach, and SAS ran for 6 hours before stopping with the warning: "Obtaining minimum variance quadratic unbiased estimates as starting values for the covariance parameters failed."

So at this point, I am excited about having a model structure that makes intuitive sense given the way the data were collected and the questions I want to ask, but still need help getting the it implemented.

Your feedback is very much appreciated!

Susi

SteveDenham
Jade | Level 19

Kind of jumping around, but you can address the quadrature error by changing the subject of the second random statement, so that it reads:

random year_season/subject=realloc*house type=AR(1);

This sets up the hierarchical containment.  Of course, it doesn't guarantee that it will run, but at least it should get rid of the error.  Oh yeah, sorting.  Make sure the data are sorted by house and realloc, as in:

proc sort data=in out=sortedout;

by house realloc year_season;

run;

Steve Denham

skr01
Calcite | Level 5

Dear Steve-

Well, replacing the subject=real_loc with subject=real_loc*house does solve the hierarchical containment issue, but SAS stops due to insufficient memory. I have been reading to no avail, and then started just dropping things out to see what that would do. Even dropping the house terms and reducing the random terms to just the repeated measures doesn't help; doing that results in the laplace method converging after hours, but without being able to estimate any parameters.

I do notice that SAS says there is 1 subject (Block in V), with a max of 10812 observations per subject. In fact, there are 2108 real_locs with up to 8 observations per real_loc. That's the only new insight I have after a lot of staring I'm afraid. Have you any suggestions about how I might appropriately accommodate the most important elements of the structure of the data in a way that will run?  Or should I at this point be looking at sub-setting the data set and trying to look at it in chunks? I'm afraid that won't help -  its so unbalanced across all levels. That is because every house has a different configuration of places in it, and we could only sample what was there. I am currently running a model with only two environments (down from 7) that are the most balanced, and that have the greatest numbers of events with method=laplace; it looks more promising in that it has been running for hours -- but of course, it doesn't really address my actual hypotheses.

thanks for your insight!

susi

SteveDenham
Jade | Level 19

Can you identify "common" real_locs across houses?  Say, "kitchen counter" or "bathroom sink", and maybe reduce the problem that way?

Another approach might be to do some cluster analysis, and see if the real_locs can be clustered in such a way that instead of over 2000, you have a dozen or so strongly associated locations.

I just looked back at the original model statement.  Is Recovery a binary (0, 1) or a proportion?  Sometimes the binomial doesn't work well on repeated measures if all of the values are 0 or 1.  That would be a case for maybe changing to dist=binary.  The other approach would be to collapse on some measure to get a proportion or something that can be expressed in the events/trials syntax, but that doesn't seem obvious as to how to approach it (at least to me).

Steve Denham

skr01
Calcite | Level 5

Indeed, recovery is binary, but I have already coded dist=binary, so no luck on that front.

As for common reallocs, I'm not sure I understand. Each house has over 100 places that got sampled (kitchen sink drain, vegetable drawer, computer keyboard...). These categories of places are binned into environments (7). Each house got visited up to 8 times, thus repeated measures on these physical locations (reallocs), occurring on the 8 days each house was visited. How would I approach clustering, and could I do it with a binary response?

I did try to look at whether the imbalance arising due to locations being present sometimes and not others was driving the problem. I ran a model with only drains and trash cans, since there are lots of them and they occur in all houses at all samplings. I used the G-side approach you recommended, and the method=laplace. It ran for 17 hours and then stopped due to insufficient memory. Even if it has worked, it takes me pretty far from the hypotheses I am after. Is there a way to come at this from another angle perhaps?

SteveDenham
Jade | Level 19

So realloc is nested in environment?  If so, that could lead to an events/trial syntax, and simplify the model at the same time.

What about:

proc glimmix data=mydata;

class house year_season environment season;

model Recovery_sum/Sampled_sum=environment year_season/ dist=binomial link=logit ddfm=residual;

random int environment/ subject=house;

random year_season / subject=environment*house type=AR(1);

covtest=wald;

nloptions tech=nrridg maxiter=250;

run;

where Recovery_sum sums Recovery across all real_locs within a house*environment, and Sampled_sum sums the number of real_locs within a house*environment.

Steve Denham

skr01
Calcite | Level 5

Oh, that is very interesting!!! I think that real_loc is actually nested within environment*house, rather than environment, but I also think that is how you actually coded it in the code you suggested. I am trying it now (actually, I am reconfiguring the data so as to be able to try it...). A few followup questions: Returning to my core hypotheses/questions that I want my final model to address, they are:

a&b) Are there differences among types of environments and seasons in the probability of recovering our bacterium?

c&d) Is there significant variation among houses in rates of recovery?, and Does recovery among environments vary across houses?

e) Is the probability of recovering the bacterium at a later sampling correlated with whether it was recovered there in the past? (in other words, not only do I know I need to accommodate the repeated measures in the model, I am actually interested in whether there is a "significant effect of repeated sampling in a location").

If I modify your model to read:

proc glimmix data=mydata;

class house year_season environment season;

model Recovery_sum/Sampled_sum=environment season year_season/ dist=binomial link=logit ddfm=residual;

random int environment/ subject=house;

random year_season / subject=environment*house type=AR(1);

covtest=wald;

nloptions tech=nrridg maxiter=250;

run;

I believe that a&b are addressed in the model statement, but I remain confused about how to interpret year_season, which logically to me seems like a random factor nested within season, but which I understand to be necessary in the model statement if I want it to be a repeated effect further down.

I see that c&d are addressed in the first random statement.

And I believe that e can no longer really be addressed, in that biologically it is very different if I find a bug in the kitchen sink twice in a row, vs if I find it once in the kitchen sink and the bathroom sink (those would be two real_locs that would be collapsed into one level of  "environment*house". However effects of the repeated measures are at least accommodated so as to make the tests of a-d valid. Is that correct? If so, but the model runs as opposed to not running, I can certainly live with it!

thanks!

SteveDenham
Jade | Level 19

I think you have the issues addressed.  Point e) is still addressed by the AR(1) coefficient.  It is the correlation between successive samples on the experimental units which are now environments within houses.

Steve Denham

Message was edited by: Steve Denham

skr01
Calcite | Level 5

The model still doesn't run, but that is clearly because of additional/spin off issues (which I may come back soon and ask about as a separate post, if I cannot resolve them). With respect to this one -- I wish I could mark two responses as correct. The idea of creating a season_year to eliminate the different start time problem among "samples" labeled 1-8 clearly addresses my major motivator for beginning this thread, but the idea of thinking of real_loc as being nested within environment opened up a different approach to the problem. Neither model will run at this point, but two options to optimize are better than one. What is interesting to me is that both of these insights have to do with thinking about a sort of nesting that I had not considered. I hope to be able to generalize that to different situations in the future.

Thanks for your help!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 2887 views
  • 3 likes
  • 2 in conversation