turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- General Programming
- /
- proc glimmix, insect counts

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-06-2012 03:51 PM

hi,

I have 4 different treatements, my ultimate goal is to know whether the treatments have an effect on the number of a particular insect-species present (counts) and how the treatements relate to each other.

I have 3 different locations where the experiment is repeated, in every location the 4 treatments are represented. I repeated the measurement on these three locations 4 times, each time interval approximately one month. The variation of the counts will thus be prone to contain random variation due to the local and temporal differences of the abundancies of the insect apart from the variation accounted by the treatment. The insects will be more or less abundant at different dates and at different locations.

Because of this i thought a proc glimmix with date and location as a random effect would be appropriate for the job.

I would like to know if this glimmix approach is just and if so the following model is correct to use for this:

proc glimmix data= wholeseason ;

class loc treatment date;

model count=treatment/dist=poisson link=log solution;

random loc date;

lsmeans treatment/cl ilink diff;

run;

or should i add in the random statement also date*loc? How can i compare the different models for better fit?

thank you,

Bas

data wholeseason;

input loc $ treatment $ date $ count;

cards;

mb g 1 4

md g 1 21

pv g 1 1

mb i 1 1

md i 1 20

pv i 1 1

mb k 1 27

md k 1 23

pv k 1 25

mb t 1 0

md t 1 2

pv t 1 0

mb g 2 16

md g 2 17

pv g 2 2

mb i 2 17

md i 2 9

pv i 2 9

mb k 2 32

md k 2 18

pv k 2 22

mb t 2 39

md t 2 49

pv t 2 28

mb g 3 8

md g 3 3

pv g 3 5

mb i 3 13

md i 3 15

pv i 3 4

mb k 3 20

md k 3 18

pv k 3 7

mb t 3 1

md t 3 10

pv t 3 10

mb g 4 2

md g 4 1

pv g 4 2

mb i 4 2

md i 4 16

pv i 4 9

mb k 4 4

md k 4 14

pv k 4 6

mb t 4 5

md t 4 14

pv t 4 10

run;

Accepted Solutions

Solution

11-27-2012
02:25 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-27-2012 02:25 PM

When things did not work out so quickly, I had to explore why. I believe the Poisson distribution is not appropriate for these data. Remember, for Poisson distributed data, the population mean and variance are equal. For these data, the variances are substantially larger than the means. This implies that the negative binomial distribution would be more appropriate. Additionally, the variances are not constant over treatment. The likelihood surface is also quite flat, so that tuning of convergence methods are in order, as well as moving to the most conservative optimization technigue (conjugate gradient). Fianlly, for repeated measures, especially with small datasets, a Kenward-Rogers adjustment to the denominator degrees of freedom reduces bias. Consequently, I offer the following, with changes in **bold**:

proc glimmix data= wholeseason **initglm pconv=1e-6** method=mspl;**nloptions maxiter=2000 tech=congra**;

class loc treatment date;

model count=treatment date treatment*date/**dist=negbin** link=log solution **ddfm=kr2**;

random intercept/subject=loc;

random date/residual subject=loc type=ar(1) **group=treatment**; /* Repeated measures analysis*/

lsmeans treatment/cl ilink diff;

lsmeans treatment*date/cl ilink slicediff=date;/* Slicediff gives treatment comparisons at each date */

ods output fitstatistics=fitfull;

run;

/* Fit reduced model using subject based maximum pseudolikelihood method*/

proc glimmix data= wholeseason **initglm pconv=1e-6** method=mspl;**nloptions maxiter=2000 tech=congra;**

class loc treatment date;

model count=treatment date /**dist=negbin** link=log solution **ddfm=kr2**;

random intercept/subject=loc;

random date/residual subject=loc type=ar(1) **group=treatment**; /* Repeated measures analysis*/

lsmeans treatment/cl ilink diff;

ods output fitstatistics=fitred;

run;

data fitfull1;

set fitfull;

if descr eq '-2 Log Pseudo-Likelihood';

val_full=value;

keep val_full;

run;

data fitred1;

set fitred;

if descr eq '-2 Log Pseudo-Likelihood';

val_red=value;

keep val_red;

run;

proc sql;

create table testit as select * from fitred1, fitfull1;

quit;

data testit;

set testit;

chival=val_red-val_full;

probchi=1-cdf('chisquare',chival,9);

run;

Now, after all of this, I find that the p value for the significance of including the interaction term was 0.05284, while under restricted maximum pseudolikelihood, the p value was 0.0464. Not much different, and certain to lead to a lot of head scratching. Now, I pull out my soapbox and start the speech about how it is probably not a good idea to select models based on p values. Models, especially models as simple as this, should be based on the experimental design, and factors not excluded simply because they are "not significant.". In the end and if this were my data to analyze and explain, I would use the default method of rspl to rank fully specified models by information criteria, keeping in mind that the pseudo-data under the various structures do differ, but appealing to Lindsey (2007) *Applying Generalized Linear Models*, and using the corrected pseudo-AIC (AICC) as a ranking method. That would give rise to the following code:

proc glimmix data= wholeseason **initglm pconv=1e-6** ic=pq;**nloptions maxiter=2000 **;

class loc treatment date;

model count=treatment date treatment*date/**dist=negbin** link=log solution ddfm=kr2;

random intercept/subject=loc;

random date/residual subject=loc type=ar(1) **group=treatment**; /* Repeated measures analysis*/

lsmeans treatment/cl ilink diff;

lsmeans treatment*date/cl ilink slicediff=date **adjust=simulate(seed=1) adjdfe=row**;/* Slicediff gives treatment comparisons at each date, with multiple adjustment method of Berry and Edwards */

ods output fitstatistics=fitfull;

run;

I realize this is long, but I hope it is helpful. It has been for me.

Steve Denham

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-07-2012 09:01 AM

I would think about this just a little differently, especially about the date. I would worry about autocorrelation, and treat this as a repeated measures analysis. The code would look like:

proc glimmix data= wholeseason ;

class loc treatment date;

model count=treatment date treatment*date/dist=poisson link=log solution;

random intercept/subject=loc;

random date/residual subject=loc type=ar(1); /* Repeated measures analysis*/

lsmeans treatment/cl ilink diff;

lsmeans treatment*date/cl ilink slicediff=date;/* Slicediff gives treatment comparisons at each date */

run;

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-16-2012 08:18 AM

thank you,

cold you tell me why the date is also present in the fixed effects? the random date/residual subject=loc- statement takes the date into account already? Or is that merely the effect of differences between the dates on the same location and then the date in the fixed effects the effect of the difference of the dates on the whole group-mean of the locations on that date, the effect of the dates on the overall count?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-19-2012 08:03 AM

I think I understand what you are saying, but here is my reasoning. There is a definite (fixed) effect over time at each location--it is why the random statement includes the residual option. See the GLIMMIX documentation regarding R-side variation.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2012 08:21 PM

Thanks, reading the documentation was clearifying about that matter.

Now i have a model that makes sense, but how do i compare this model with alternative ones?

I cannot use the information criteria, nor can i asses any likelihood-ratio statistic if i have random components incorporated in the proc glimmix because of the pseudo-likelihoods which are generated and which are not to be compared between models as it is other procedures.

I searched the SAS communities for helpfull answers on other peoples questions regarding this matter, but there does not seem to be an answer...

exept for using the quadrature method or the laplace method by which i then would obtain usable information criteria and likelihoods, but because i have this R-side random effect, it is not possible to use these methods.

In this perticular case, i do not see which alternate models i even should compare except for removing the fixed effects based on their p-value and then checking by using the likelihood ratio test (which i cannot do in this procedure)? Because i know there is a random effect on the location and i know there is going to be the repeated measurements structure, i would only compare different types of structure for the R-side (compound symmetry, univariate, toep,ar(1)...and i could check if my assumption of a poisson distribution is correct.

i find these matters very confusing...

Bas

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-21-2012 09:00 AM

I agree. It is confusing. A two stage plan might help. First, using the most exhaustive model, choose an appropriate covariance error structure. Information criteria can be used as a guide here, as the model does not change. Then, and only then, fit the nested models using maximum likelihood (not restricted maximum likelihood) to look for substantial reduction in the log likelihood. The differences, under a single error structure, including random effects, should be asymptotically distributed as chi squared, with df=number of constraints applied (here the constraint is that the coefficients for a variable are set to zero).

There are a lot of weasel-words in here: "guide", "asymptotically distributed", etc. that mean the theory says "You cannot..", but practice says, "Well, you can if you are very careful about interpreting results..." Make lots of graphical output--if the picture does not fit the data, then obviously the model is incorrect/incomplete.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-26-2012 07:43 PM

I tried to comprehend this, but i failed...

'Information criteria can be used as a guide here, as the model doesn't change', so i can use the pseudo AIC's (the only ones i can summon here) for choosing the best fitting covariance structure as long as i keep my model statement fixed (count=treatment date treatment*date dist=poisson link=log)?

After i have chosen the right structure I should be comparing the nested models: count= treatment date treatment*date/count=treatment date/count=treatment...

by using the maximum likelihoods...but there resides my next problem now...you state it very easy: compare the nested models using maximum likelihood (not restricted). How do i do this? I only have a -2 pseudo restricted log likelihood in my output. If i could work with this i would (in lack of finding the statement that is going too compute a likelihood ratio test for me in the proc glimmix) compute the difference between full and null model manually (so actually (2pseudo restricted log likelihood null model -2pseudo restricted log likelihood full model) , then check if this value is significant in a chi-square vs p-value table with df=1 (as i want to compare the model with interaction against model without interaction so 1 constraint)...when significant, i could then reject te null model in favor of the full model...

I am also worried about the fact that i will probably only find that only the full model will be to be accepted...

but i have this creeping feeling this is not the way to go...but what is then? 3 courses of statistics during my biology-training..but now for my final apoptheosis of the training i feel such a failure in comprehending these more advanced manners of statistics...so i am sorry for the many questions that keep coming up, very greatfull for the answers you've allready given me...

Bas

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-27-2012 11:24 AM

*'Information criteria can be used as a guide here, as the model doesn't change', so i can use the pseudo AIC's (the only ones i can summon here) for choosing the best fitting covariance structure as long as i keep my model statement fixed (count=treatment date treatment*date dist=poisson link=log)?*

Correct.

As far as testing the models, remember that ODS output is your friend--it is an easy way to get info into a dataset. In the full model, change to the following:

/* Fit full model using subject based maximum pseudolikelihood method*/

proc glimmix data= wholeseason method=mspl;

class loc treatment date;

model count=treatment date treatment*date/dist=poisson link=log solution;

random intercept/subject=loc;

random date/residual subject=loc type=ar(1); /* Repeated measures analysis*/

lsmeans treatment/cl ilink diff;

lsmeans treatment*date/cl ilink slicediff=date;/* Slicediff gives treatment comparisons at each date */

ods output fitstatistics=fitfull;

run;

/* Fit reduced model using subject based maximum pseudolikelihood method*/

proc glimmix data= wholeseason method=mspl;

class loc treatment date;

model count=treatment date /dist=poisson link=log solution;

random intercept/subject=loc;

random date/residual subject=loc type=ar(1); /* Repeated measures analysis*/

lsmeans treatment/cl ilink diff;

ods output fitstatistics=fitred;

run;

/* Put data sets together to get a chi-squared test*/

data fitfull1;

set fitfull;

if descr eq '-2 Log Pseudo-Likelihood';

val_full=value;

keep val_full;

run;

data fitred1;

set fitred;

if descr eq '-2 Log Pseudo-Likelihood';

val_red=value;

keep val_red;

run;

proc sql;

create table testit as select * from fitred1, fitfull1;

quit;

/* Calculate chi-squared probability */

data testit;

set testit;

chival=val_red-val_full;

probchi=1-cdf('chisquare',chival**,<INSERT THE DEGREES OF FREEDOM ASSOCIATED WITH THE INTERACTION HERE);**

run;

Note that in the calculation of probchi, it uses the degrees of freedom associated with the interaction, because there is one constraint for each degree of freedom. In this case, it should be nine (treatment = 3df * time = 3df). Also, be very aware that the test associated with the Type 3 F test for the interaction in the full model is NOT going to result in the same probability as that associated with this test (waves hands in air and says stuff about maximum likelihood vs. restricted maximum likelihood, pseudo-likelihoods vs. true likelihoods, and containment tests vs. orthogonal tests).

I hope this helps.

Steve Denham

Message was edited by: Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-27-2012 11:40 AM

Just ran this on your data. I had to set the type= to compound symmetry (CS) in the random residual statement to get decent starting values for the algorithm. That happens sometimes with generalized linear models. If compound symmetry is not a valid assumption for the error structure, this is really going to get messy.

Steve Denham

Solution

11-27-2012
02:25 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-27-2012 02:25 PM

When things did not work out so quickly, I had to explore why. I believe the Poisson distribution is not appropriate for these data. Remember, for Poisson distributed data, the population mean and variance are equal. For these data, the variances are substantially larger than the means. This implies that the negative binomial distribution would be more appropriate. Additionally, the variances are not constant over treatment. The likelihood surface is also quite flat, so that tuning of convergence methods are in order, as well as moving to the most conservative optimization technigue (conjugate gradient). Fianlly, for repeated measures, especially with small datasets, a Kenward-Rogers adjustment to the denominator degrees of freedom reduces bias. Consequently, I offer the following, with changes in **bold**:

proc glimmix data= wholeseason **initglm pconv=1e-6** method=mspl;**nloptions maxiter=2000 tech=congra**;

class loc treatment date;

model count=treatment date treatment*date/**dist=negbin** link=log solution **ddfm=kr2**;

random intercept/subject=loc;

random date/residual subject=loc type=ar(1) **group=treatment**; /* Repeated measures analysis*/

lsmeans treatment/cl ilink diff;

lsmeans treatment*date/cl ilink slicediff=date;/* Slicediff gives treatment comparisons at each date */

ods output fitstatistics=fitfull;

run;

/* Fit reduced model using subject based maximum pseudolikelihood method*/

proc glimmix data= wholeseason **initglm pconv=1e-6** method=mspl;**nloptions maxiter=2000 tech=congra;**

class loc treatment date;

model count=treatment date /**dist=negbin** link=log solution **ddfm=kr2**;

random intercept/subject=loc;

random date/residual subject=loc type=ar(1) **group=treatment**; /* Repeated measures analysis*/

lsmeans treatment/cl ilink diff;

ods output fitstatistics=fitred;

run;

data fitfull1;

set fitfull;

if descr eq '-2 Log Pseudo-Likelihood';

val_full=value;

keep val_full;

run;

data fitred1;

set fitred;

if descr eq '-2 Log Pseudo-Likelihood';

val_red=value;

keep val_red;

run;

proc sql;

create table testit as select * from fitred1, fitfull1;

quit;

data testit;

set testit;

chival=val_red-val_full;

probchi=1-cdf('chisquare',chival,9);

run;

Now, after all of this, I find that the p value for the significance of including the interaction term was 0.05284, while under restricted maximum pseudolikelihood, the p value was 0.0464. Not much different, and certain to lead to a lot of head scratching. Now, I pull out my soapbox and start the speech about how it is probably not a good idea to select models based on p values. Models, especially models as simple as this, should be based on the experimental design, and factors not excluded simply because they are "not significant.". In the end and if this were my data to analyze and explain, I would use the default method of rspl to rank fully specified models by information criteria, keeping in mind that the pseudo-data under the various structures do differ, but appealing to Lindsey (2007) *Applying Generalized Linear Models*, and using the corrected pseudo-AIC (AICC) as a ranking method. That would give rise to the following code:

proc glimmix data= wholeseason **initglm pconv=1e-6** ic=pq;**nloptions maxiter=2000 **;

class loc treatment date;

model count=treatment date treatment*date/**dist=negbin** link=log solution ddfm=kr2;

random intercept/subject=loc;

random date/residual subject=loc type=ar(1) **group=treatment**; /* Repeated measures analysis*/

lsmeans treatment/cl ilink diff;

lsmeans treatment*date/cl ilink slicediff=date **adjust=simulate(seed=1) adjdfe=row**;/* Slicediff gives treatment comparisons at each date, with multiple adjustment method of Berry and Edwards */

ods output fitstatistics=fitfull;

run;

I realize this is long, but I hope it is helpful. It has been for me.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-03-2012 02:38 PM

That sure was a whole heap of information, but that is very good. The more information i gain, the more i shall know.

I will use the latter you provided, ranking by using the pseudo-AICc value.

thank you very much!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-17-2012 05:51 AM

Is it justified to do a similar analysis on 2 locations with the four treatments and 1 location containing only two of the treatments?

or is that to little? Steve said in another post that the incomplete design should not matter when using the random location, that it estimates the marginal means, but i want to know if it is just to do so in the case of one location less where the four treatments are present...i am struggling with how to justify such small sample size...

also, when things do not converge (also not after adjusting the pconv, gconv to rediculously high values), but they do when i drop the repeated measurements, is it then just to say...ooow, this has no repeated measurements pattern in it, or should i say that there is no 'statistical' answer to the problem?

and what to do if there is only the reduced model that coverges? What if the treatment*loc doe not find any model, even after adjusting it multiple times, all little subtle differences...

should i accept this model then?

I should have picked a literature study instead of this madness...every time that i think yes i can go forth with this there pops up another problem...

thank you,

bas

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-17-2012 08:16 AM

I am going to answer the convergence thing first--does the model converge if you remove the group=treatment option? I put that in for the case that the pattern was not really similar across treatments. It is stopping the convergence because there is insufficient data to estimate all of the parameters.

Back to the small sample size then. This is when you need to apply appropriate denominator degrees of freedom adjustment, and in particular, the Kenward-Rogers adjustment. If you are using SAS/STAT12.1, use ddfm=kr2; in earlier versions, use ddfm=kr(firstorder). There is a lot of good work in the literature with sample sizes akin to what you have.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-17-2012 01:06 PM

I believe i tried all hose options...

could i again add my counts up, then run analysis? reminding that i then have only two replicates for the k-group and the g-group?

i really don't get it how this can work, even with the ddfm-correction...

i cannot find any literature on these low sample sizes,i can find it for small samples, but not this small...could you point them out?

I am so sorry for my many questions

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-17-2012 01:16 PM

Try this:

proc glimmix data=solitarybeesall cholesky initglm;

nloptions tech=newrap;

class treatment loc date;

model count=treatment|date/dist=negbin s ddfm=kr2;

random intercept/subject=loc s;

random date/type=chol subject=loc s;

lsmeans treatment*date/ilink slicediff=date;

run;

In this approach, I treated date as a G side effect, rather than as an R side effect. The interesting point is that there are dates with zero, or essentially zero, variance estimates (date = 2, 3, or 4). Consequently, when date is treated as an R side effect, the algorithm does not find a feasible starting set of parameter estimates. This specification does model the variance on a subject basis, and so I would consider it a very close approximation to a repeated measures analysis.

Steve Denham