Re: Choosing multivariate multiple regression in GLIMMIX

mrlang02 · Posted 10-13-2014 10:54 PM

Thank you to all who read my previous pos, but I unfortunately did not get a response. After rethinking my problem, I decided that Canonical Correspondence Analysis or Multivariate Multiple Regression might fit the bill. I am first going to try using Multivariate Multiple Regression, as I can (presumably) do this in GLIMMIX, where I would be able to include random effects (research site is a random effect in my design), and specify non-normal distributions (Poisson, as I am using count data).

Briefly, my research question involves modeling how the density of 3 classes of wetland plants (my response variables) varies with plot hydrology (hydro) and leaf area index (LAI). I want to be able to state results to answer these subquestions:

-Does LAI, hydrology, and the interaction between these two factors affect the density of wetland plants?

-Do the different classes of plants respond differently to these factors and if so, in what direction?

My 2 questions at this point are

1) Can I model this question in GLIMMIX, using multivariate multiple regression (I believe I have to use the BYOBS option) ?

2) Will I be able to draw the conclusions I aim to, as stated above, by using this method?

If someone can answer this question, I have more questions about the coding, but I figure I need to start with basics right now!

Additionally, if you have other suggestions about how to model the data, please contribute!

Meghan

MORE INFO ABOUT MY DATA:

I have 3 research sites

Within each research site, I have multiple plots, each with counts of each class of wetland plant, and measures of plot hydrology and LAI.

SteveDenham · Posted 10-14-2014 07:43 AM

I believe you are heading down the right road, but before I commit, one question? What are the dependent variables. From this description, I see only one dependent variable--count per plot. If you have multiple species, I would fit that as a fixed effect, with possible different variances (GROUP= option). Can you be a bit more specific on the design (number of species, number of observations, etc.)?

Steve Denham

mrlang02 · Posted 10-14-2014 10:11 PM

Steve, thank you so much for your input. You are correct that my only dependent variable is count per plot. But I have counts for each species group per sample unit, so I am unsure if each should be viewed as a separate variable or not.

My first instinct was to fit "species group" as a fixed effect, as you mentioned, but wasn't sure of the validity of this approach. It seemed odd to include this as a fixed effect, because it is a classification of the dependent variable, and not a treatment or other independent variable. But here is the code for my initial attempt at the problem, where I DID try to model species group as a fixed effect (IND in the model statement). More info about my design is at this post

*"Site" identifies research site,"IND" is species group, "hydro" is a continuous variable, a measure of plot wetness, and "LAI" is a continuous variable, a measure of leaf area index.

proc glimmix data=shrub maxopt=100;

class site ind;

model count= hydro ind ind*hydro / dist=poisson link=log solution;

random site;

covtest 'Global Test random effects' ZEROG / CL WALD;

output out=glimmixout predicted=pred Pearson=PearsonRes; run;

So, are you saying it is valid to model IND as a fixed effect, even though it is really just a category of the dependent variable? (e.g. all shrubs were counted, and tallied according to species groups. IND identifies those species groups. IF SO, is my model statement stated correctly? I am wanting to know if the regression slopes of each species response (count) to hydrology differ from one another. In other words, do species respond the same, or differently to increasing wetness?

Thank you again.

Meghan

SteveDenham · Posted 10-15-2014 07:44 AM

IND is certainly a fixed effect to my way of thinking. I don't see it as a classification of the dependent variable at all.

It appears that you went to a site, found a species, and counted the number of plants, found the next species, counted, etc. Hydro was measured for the entire site, while leaf area index is probably a composite(?) across the plants at a site. Am I getting this right? If so, then try:

proc glimmix data=shrub maxopt=100;

class site ind;

model count= hydro lai ind ind*hydro ind*lai ind*hydro*lai / dist=poisson link=log solution;

random site;

The solution vector is going to have a lot of zeroes, I suspect, so slopes will have to be calculated from the estimates OR you could try a "means model" approach where the model statement is something like:

model count=ind*hydro*lai/noint dist=poisson solution;

One other consideration--ecological counts are almost always overdispersed, so you may want to consider a negative binomial distribution.

Steve Denham

SteveDenham · Posted 10-15-2014 08:04 AM

Just saw on the other thread that you have a lot of zeroes. Zero inflation is something that isn't handled well by GLIMMIX in its current version. When you say "a lot", what proportion of zeroes do you have? You may have to treat site as a fixed effect, and model this in PROC GENMOD to get reasonable values.

Steve Denham

mrlang02 · Posted 10-15-2014 11:03 AM

Thanks Steve, you anticipated a lot of my issues. I have had a lot of trouble using my data sets in PROC GLIMMIX because of excess zeroes and even using the negative binomial doesn't help with many of the tests. My next thoughts were to go to a zero-inflated model using NLMIXED, but it looks kind of complicated, so I've been trying to avoid that. But maybe I will try genmod. My only concern there is the number of random effects--for the purposes of asking the question in the forum, I simplified the random effects to SITE only. My sampling design didn't seem complicated at the time, but when I got to the stats, I realized it was. There are three different spatial scales SITE, SIDE, TRANSECT1, TRANSECT 2, with each "nesting" inside the other. TRANSECT 2 is the unit in which I actually sampled the plants.

But I think this gives me a better idea of my options and confirms the issues I've been having getting the models to run.

And in response to your comment "IND is certainly a fixed effect to my way of thinking. I don't see it as a classification of the dependent variable at all.". I just want to make sure you understand what how the IND (indicator status of each plant) variable relates to my data. Let me clarify.

I counted all stems of all species in each sampling unit, and each sampling unit also has a measure of Leaf Area Index (LAI), and the hydro variable.

Then, because it made more ecological sense, I aggregated stem counts of species with the same indicator status (IND). So I have stem counts for each indicator group in each sampling unit, that were calculated simply by adding together counts of species with the same indicator status in each sampling unit. Indicator status is a property of the species, not the sampling unit.

For example if I had this data for a sampling unit

SPECIES STEM_COUNT IND

A 1 FAC

B 2 FAC

C 1 FACW

D 1 FACW

Then I ended up with this summary data set, where the identify of individual species is omitted.

IND STEM_COUNT

FAC 3

FACW 2

And then, for each sampling unit I have data about hydrology and LAI. I want to know if species of different indicator statuses respond significanty and differently to hydrology.

Thank you for all your help!

SteveDenham · Posted 10-15-2014 02:43 PM

So IND isn't really a species indicator. Is it a "stage" indicator that can easily be seen to be similar from species to species? I imagine something like (and pardon my ignorance of the proper botanical nomenclature) "pre-bloom", "bloom", "early fruit", "late fruit"--although maybe not so many categories.

Now on to design. It looks like SIDE, TRANSECT1, TRANSECT2 are all nested levels within SITE. Since they really are random samples of something, why not allow GLIMMIX to sum the counts per site? That's easily done by not specifying SIDE, TRANSECT1 and TRANSECT2 as effects--and it might overcome some of the zero inflation. I would guess that you really don't have enough data to get good variance component estimates for a four-level hierarchical model. And to really look at the effect of the two covariates, you would need to look at random slope models that accommodate all levels, plus the interaction of the fixed and random effects, fit as random effects. I would really try to stay with the simpler, more aggregate model you have.

And since all of the variability is being model as G side variance component, I would consider method=laplace in the PROC GLIMMIX statement.

Steve Denham