BookmarkSubscribeRSS Feed
Kai1989
Fluorite | Level 6

Hi, Dear All,

 

I think I have really poor background in statistic, and now is struggling on my data analysis with an animal preference study.

Here is the study description:

We have two sets of light preference testing pens (i.e, A and B), for each testing pen, we have two cages with a door/window that allow birds to move freely between each other (as shown in the image).  Groups of birds (3 birds) were treated as the experimental unit, and totally 12 groups of birds were tested (two groups each time, one in A, and the other one in B, repeated 6 times).

The treatment is the light. We have one cage used LED light, and the other cage use CFL light so birds can choose the light environment they prefer. The lights were swaped at night during the four-day tests as it may have cage effects.

Light arrangements were as followings

             cage 1   cage 2  

day 1     LED        CFL

day 2     CFL        LED

day 3     LED        CFL

day 4     CFL        LED

The variable is the time spent of birds (hr/bird-day) under each light/cage

Top-view.jpg

A prelim SAS code I write as follows:

Proc Mixed;

class trial pen light cage day ;

model timespent = light cage light*cage; 

random trial trial*pen;

repeated day /subject=trial*pen type=AR(1);

I really have no idea about the code actually.

 

could anyone tell me how to analyze this type of experiment? thanks

 

 

 

15 REPLIES 15
sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Perhaps the best experimental design advice ever came from Ronald Crosier, who said on sci.stat.edu at least two decades ago: No one should ever do an experiment without analyzing it first. (See http://mathforum.org/kb/message.jspa?messageID=1515598 for his elaboration on what he meant).

 

There surely are many ways to approach the analysis of this study. I will put forth one that is relatively simple but does make assumptions about sources of variance that may or may not be appropriate, and it involves statistical design and analysis concepts with which you likely are not familiar (for examle, crossed random effects, non-normal distributions).

 

I assume that the total amount of time for each trial is the same; consequently, the sum of the amount of time in the LED cage and the amount of time in the CFL cage is the equal to the total time. These two metrics are not independent: if you told me how much time the group spent in the LED cage, I could tell you exactly how much time it spent in the CFL cage. Consequently, you can use time spent in, say, the LED cage as the response and completely ignore the time spent in the CFL cage. (You could also use the proportion of total time spent in the LED cage; however proportions can have challenging distributional properties--the beta distribution in a generalized linear mixed model might be appropriate for a proportion response--and you might have better success with timeLED, perhaps with a transformation. The choice depends on the actual data values, as well as your preference for response.)

 

I would think of pen and day as crossed random-effects factors; this assumes that pen and/or day contribute to the variance of the response but have no effect on the mean response--in other words, neither pen nor day are fixed effects factors--which seems plausible to me (but may not be a correct assumption). The intersection of pen (with 2 levels) and day (with 6 levels) is a unique group of 3 birds (I'm assuming that you have 12 different groups of birds), and the experimental unit defined by this intersection==group_of_birds (which may also be ==trial) is the experimental unit for the fixed-effects factor, light. These assumptions lead us to 

 

class pen day light;

model timeLED = light;

random pen day;

 

Issues that might arise: (1) timeLED may not be normally distributed with homogeneous variances for both light levels; (2) variance among pens and/or among days may have estimation problems; I would expect both variances to be small (but, again, that might not be a correct assumption).

 

An excellent resource is https://www.sas.com/store/books/categories/usage-and-reference/sas-for-mixed-models-second-edition/p.... Time spent with this text is not time wasted. 

 

I hope this helps get you going.

 

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Oops, my apologies. A sudden realization that I have the wrong vision here, and rather than edit my first response, I'll add another.

 

The amount of time spent under LED light is not independent of the amount of time spent under CFL light, and the response can be defined by only one or the other. But then the light factor is no longer needed, because we do not have response values under the two different light levels.

 

So, I would try using the proportion of total time spent under LED light as the response:

 

 

proc glimmix data=have method=laplace;
    class pen day;
    model proportionLED = / dist=beta intercept s;
    random pen day;
    estimate "Intercept" intercept 1 / ilink cl;
    run;

 

and I would test the hypothesis that proportionLED = 0.5 (in other words, whether half of the time is spent under LED light and thus half under CFL and thus no preference), which would be accomplished by testing whether the intercept = zero (because the beta distribution uses a logit link, and logit(0.5)=0). Placing a confidence interval on the intercept would be nicely informative.

Kai1989
Fluorite | Level 6

Thanks very much for your reply and explanation

 

I applied your code in my data analysis

proc glimmix data=preference method=laplace plots=all;
	where light="LED";
	class birdgroup day;
	model timeproLED=/ dist=beta link=logit intercept s;
	random day*birdgroup ;
	estimate "Intercept" intercept 1 / ilink cl;
run;

With this SAS code, I am able to test if the timeproLED (proportion of time spent under LED light) equals to 0.5. The code gives a estimated Mean value, and 95% ci values. However, with this method, the "Mean" does not match the mean calucalated by Proc Means procedure. 

 

As you told, for each day, time spent in LED and time spent in CFL were not independent responses (correlated), what do you think about the paired t-test for this study? To do test, I tried to average all values over the four testing days for each birdgroup (12 groups in total), then run code as followings? 

proc ttest data=preference;
paired LEDdur*CFLdur;
run;

Thanks very much

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

You're welcome.

 

First comment: Note that your RANDOM statement is not what I recommended. The different RANDOM statements generate different models with different assumptions about the role of variables in the design.

 

Second comment: The statement

estimate "Intercept" intercept 1 / ilink cl;

will provide an "Estimate" and "Standard Error" which are on the link scale; here, the link is logit. The columns labeled "Mean" and "Standard Error Mean" (generated by the ilink option; ilink means inverse link) are estimates on the proportion scale (i.e., the proportion of time spent under LED light), something (but not exactly) like back-transformations of "Estimate" and "Standard Error", respectively. "Estimate" will not look anything like a mean proportion computed from the raw data. "Mean" will be close, but it will not be the same as the mean computed from the raw data for various reasons that are too complicated to describe here--but are covered in Stroup (2013) Generalized Linear Mixed Models, CRC Press.

 

The paired t-test is wrong. The proportion of time under LED is not just (negatively) correlated with the proportion of time under CFL; these two metrics are totally redundant--there is no information in one metric that is not already in the other. If you told me that in one trial, the proportion of time under LED was 0.62 (62%), I could tell you that the proportion of time under CFL in that trial was 0.38, because these two values must sum to 1 (100%). Applying a t-test to these "paired" data is not valid. Don't do it.

Kai1989
Fluorite | Level 6

My dataset looks like the following tables, we have two identical testing chamber (pair), named A and B, each testing camber has two cages. we have 12 groups of birds. with this dataset, could you help to classify the Random statement?

Still, I am not quite understand about your statment:

The proportion of time under LED is not just (negatively) correlated with the proportion 
of time under CFL; these two metrics are totally redundant--there is no information in one
metric that is not already in the other. If you told me that in one trial, the proportion
of time under LED was 0.62 (62%), I could tell you that the proportion of time under CFL
in that trial was 0.38, because these two values must sum to 1 (100%).

althtough the sum is fixed, but why can't we compared this two values directly?

 

 

proc glimmix data=have method=laplace;
where light="LED"; class pair group day day cage; model lightpre = / dist=beta intercept s; random pen day; estimate "Intercept" intercept 1 / ilink cl; run;

 

flock pair group day cage light lightdur lightpre

1 A 1 1 1 LED 4.07 40.74

1 A 1 1 2 CFL 5.93 59.26

1 A 1 2 1 CFL 5.85 58.46

1 A 1 2 2 LED 4.15 41.54

1 A 1 3 1 LED 4.37 43.71

1 A 1 3 2 CFL 5.63 56.29

1 A 1 4 1 CFL 5.05 50.46

1 A 1 4 2 LED 4.95 49.54

1 B 2 1 1 LED 2.9 29

1 B 2 1 2 CFL 7.1 71

1 B 2 2 1 CFL 5.57 55.71

1 B 2 2 2 LED 4.43 44.29

1 B 2 3 1 LED 3.58 35.75

1 B 2 3 2 CFL 6.42 64.25

1 B 2 4 1 CFL 5.08 50.79

1 B 2 4 2 LED 4.92 49.21

 

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

 

I do not understand the structure of your dataset and how it reflects the design of your experiment you described in your initial question.

What is lightdur?

What is lightpre?

You say you have 12 groups of birds; how are these 12 identified in your dataset?

 

Did you repeat the trial 4 times (for example, once each day for four days) on the each group of birds?

 

I suspect the values of lightdur and lightpre are means over groups of birds. If so, then you cannot do statistical analysis on this dataset. For statistical analysis, you need data for independent replications (which I think would be groups of birds).

 

One format for a dataset that I would expect has 48 observations, one for each day for each group of birds. Variables would be (1) group identification (with 12 unique values); (2) pair (A or B); (3) day (1, 2, 3, or 4); (4) amount of time spent under LED; and (5) amount of time spent under CFL. Subsequently in a datastep, you could compute the proportion of the total time (amount under LED + amount under CFL) that a group spent under LED; this proportion is the response variable that you would use in the model I proposed.

 

Cage (the two areas within each pair) could be incorporated into the model, but it would make for a more complex statistical model, and we are already challenged by this simpler model. I would not include it at this stage. I might consider including it later if there appears to be a day effect of some sort.

 

If you want to assess preference for one light or the other, then the proportion of time spent under one light is a good metric. If you agree with that, then think about how the CFL proportion is just another way to measure the LED proportion: 

 

CFL proportion = 1 - LED proportion

 

Analyze one or the other, but you do not need both because the two metrics are just different ways to measure exactly the same thing.

 

If you want to assess the amount of time spent under one light or the other, well, that could be another analysis and is not the objective you described in your initial question. 

Kai1989
Fluorite | Level 6

Thanks very much,

 

Yes, I will use a same dataset structure as you described "One format for a dataset that I would expect has 48 observations, one for each day for each group of birds. Variables would be (1) group identification (with 12 unique values); (2) pair (A or B); (3) day (1, 2, 3, or 4); (4) amount of time spent under LED; and (5) amount of time spent under CFL. Subsequently in a datastep, you could compute the proportion of the total time (amount under LED + amount under CFL) that a group spent under LED; this proportion is the response variable that you would use in the model I proposed."

 

In terms of your Random statement

Random pen day;

I would thiink it should be

Random group day*group;

repesenting random group effect among 12 groups of birds, and random day effect among 4 days within each group of birds.

Thus we will have DF=11 for the estimates of intercept, am I right?

 

 

In the study, the objective is to determine which light is prefered by the birds, but when we present results, we want to present time spent or proportions of time for both lights, then tell which one is prefered, so we would also like to determine mean values, stderr and 95% ci for both LED and CFL lights. 

 

 

 

 

 

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Let's meet in the middle with the RANDOM statement. Try your restructured dataset with this model 

proc glimmix data=have method=laplace;
    class pair group day;
    model proportionLED = pair day pair*day / dist=beta intercept s;
    random group*pair;
    estimate "Intercept" intercept 1 / ilink cl;
    run;

This RANDOM statement estimates a variance among the 12 groups of birds. 

 

We hope that there is no statistical evidence of a pair effect, in other words, we hope that the two pairs are essentially identical. We really hope that there is no pair*day interaction; a significant interaction might imply a difference between the two cages within a pair and/or an effect of the cage assignment to light bulb type. If there is a day effect, then the pattern of the day means might imply something like "learning" by birds over time.

 

If all of the fixed effects are not significant, should you omit them from the model? That's arguable. Certainly they are "nuisance" factors: they are not components of your research hypotheses. If not significant, then retaining or omitting likely will have little effect on the Intercept estimate. But you never know.

 

This model does not address whether the covariances among the four repeated measures are unequal (it's essentially invoking compound symmetry). The beta distribution does not lend itself to explicitly modeling of the covariance structure of repeated measures; as I understand it, there is a conflict between the estimation of the scale parameter for the beta distribution and the estimation of covariance parameters for repeated measures.

 

This model does not address the assignment of light bulb type to cage within a pair that changes over time--the design is a Latin square, but not a simple one. These design elements might be added, but they would complicate the model appreciably. If there is no statistical support for pair*day, I probably would not try to model the Latin square aspects.

 

Because I've added fixed effects to the model, I would expect 10 denom df for intercept.

 

Let me know what you get!

 

Was each trial for a group of birds the same length of time? In other words, if you add time spent under LED and time spent under CFL, do you get the same value for each trial?

 

Although the model provides an Intercept estimate (and CI, etc.) for just one light type, you can get estimates for the other, either algebraically or by analyzing proportionCFL instead of proportionLED.

Kai1989
Fluorite | Level 6

Thanks very much for your time,

Yes, for each trial and each day, the testing time is 10 hour with lights-on

proc glimmix data=have method=laplace;
    class pair group day;
    model proportionLED = pair day pair*day / dist=beta intercept s;
    random group*pair;
    estimate "Intercept" intercept 1 / ilink cl;
    run;

I tried this new code for the dataset, as you/we expected, all pair, day, and pair*day effects are not significant. The DF=10 for the intercept. I think this code will be suitable to analyze my data now.

 

Indeed, time spent is just one of the response we measured. there is another response value for the preference, i.e., LEDnum (num of birds under LED at night, as birds won't move at night, so we count number of birds in each cage, we have total 3 birds for each group). for time spent or proportion of time, the value is continous data, how about the number of birds (0,1,2, or 3) or the corresponding proportion? Can we use the same model to analyze it?

 

 

What if we have three types of birds tested, i.e., pullet, turkey and broiler. For each type of birds, we run the exactly the same procedure (12 groups of birds, 2 groups of birds tested each time in the two pairs of testing comparments (A, B); each group of birds tested for 4 days), can we test the effects of bird type? For example, if I run the code reparatedly for each type of birds, we find pullets and boriler like LED, but turkey does not. How can we include Birdtype as a fixed factor in the model you have told me?

 

 

 

 

 

 

 

 

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Ah, good.

 

Another thought about the day effect: you are hoping/assuming that birds do not develop an aversion to one cage relative to the other (which is not the same as preferring one light type over another). If you were, for example, shocking birds when they were in the right-side cage, they might sensibly decide not to go there in the future, regardless of whether it had their favorite light bulb. This would be an example of carry-over effect that can be a problem in Latin square designs.

 

Regarding BirdType: Yes, it can be added to the model. Pair and BirdType are treatment factors that are associated with a Group of birds (i.e., a Group is the experimental unit for Pair and BirdType). If each trial is 10 h long, then you may be running only 2 Groups each day, requiring 24 calendar-days to complete trials for 12 Groups run 4 times. (Using "calendar-day" I mean to distinguish between the time that you ran a trial (e.g., February 15) from the "day" that marks the observation within a trial (i.e., day 1, 2, 3, 4.) For 3 different types of Group, the experiment requires 72 calendar-days. That is a fair amount of time passing; depending on the environmental conditions, bird "history", how different BirdTypes were allocated to calendar-days, and such, there could be confounding factors associated with time passing. Give that some thought, discuss the possibility with your colleagues.

 

proc glimmix data=have method=laplace;
    class pair group day birdtype;
    model proportionLED = pair | birdtype | day / dist=beta intercept s;
    random group*pair*birdtype;
    estimate "Intercept" intercept 1 / ilink cl;
lsmeans birdtype / ilink cl; run;
Kai1989
Fluorite | Level 6
proc glimmix data=have method=laplace;
    class pair birdgroup day birdtype;
    model proportionLED = pair | birdtype | day / dist=beta intercept s;
    random birdgroup*pair*birdtype /group=birdtype;
    estimate "Intercept" intercept 1 / ilink cl;
    lsmeans birdtype /pdiff adjust=tukey ilink cl;
    run;

Here is the code I used according to your answers, 

/group=birdtype

I used this option as I want each type of birds had respective standard error, then I got results as followings.

As we had three types of birds included in the model, the DF became 30 for the intercept. 

But for the leaset squares means for each type of birds, DF=30, is that right?

Besides, with birdtype being included, the intercept became the overall preferece of three type of birds, right?

Here P<0.0001 means that birds (overall of three types) don't prefer LED light, what about each individual type of birdsy? 

 

Capture.PNG

 

I really appriciate your consistent assistance. Thanks very much!!!

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

To be clear, I did not recommend heterogeneous variances by Group. You can test whether the heterogeneous variances model fits better using the COVTEST statement, or by doing a likelihood ratio test by hand. I would do that test before deciding to go with a more complex model; the homogeneous variances model might be good enough.

 

The intercept is the mean over pair, day, and birdtype. The p-value tests whether the intercept (on the logit scale) is equal to zero. A zero value on the logit scale is equivalent to a value of 0.5 on the proportion scale. If you look at the ESTIMATE output, you'll see that the point estimate for intercept on the proportion scale ("Mean") is 0.4486, with a 95% CI of (0.4257, 0.4717), which does not include the value 0.50 and so is consistent with the intercept test reported in the Type III Tests of Fixed Effects table. Birds prefer CFL over LED by a relatively small margin (55% to 45%).

 

If you had evidence of a difference among birdtypes, you would use the LSMEANS output, possibly adding pertinent LSMEANS statements for interactions if interactions were significant. For these data, proportionLED is similar for birdtypes: none of the terms that include birdtype is significant. Hence, the intercept is a valid summary.

 

What is the difference between birdtype "CFL laye" and "LED laye"?

Kai1989
Fluorite | Level 6

Thanks, I will test if the heterogeneous variances model fits better than the homogeneous variances model.

 

What is the difference between birdtype "CFL laye" and "LED laye"?

the difference between CFL layer and LED layer is the lighting experience of the birds, i.e., before using these birds in the preference test, we raised them in either LED or CFL light since day-old, we want to see if early lighting experience will influcence the preference of light in the later period.

 

In this case, I understand that the p-value is testing whether the intercept=0, or proportion=0.5. As all three birds prefered CFL light, how can I know the p-value for each type of birds? for example, if I want to know exactly p-value for pullets (proportion of time for pullet as compared to 0.5). I mean, without include birdtype as a factor, by running three separated sas code, I will have p-value of intercept for each type of bird.

 

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Look to the LSMEANS birdtype output. Each line reports the predicted estimates and the test of whether "Estimate" is equal to zero for the specified birdtype.

 

Note that the mean of the "Estimate" column over the 3 birdtypes is equal to the "Estimate" for intercept. The same thing is roughly true for the "Mean" column, but not exactly due to the nonlinear nature of the logit link.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 3262 views
  • 1 like
  • 2 in conversation