Hi, my name is Andy and I'm analyzing a large dataset using SAS Proc Glimmix
procedure. My dataset contains over 20,000 GPS records. I'm trying to
evaluate why certain deer were observed during hunting season thus I've coded
the deer that were observed with a "1" and those not observed with a "0." I
coded the entire our that the deer was observed to encompass any hunter
recording errors. My model is shown below:
PROC GLIMMIX DATA=OBS METHOD=LAPLACE;
CLASS ID YEAR EXPOSURE HABITAT_VALUE;
MODEL OBSERVED (EVENT = '1') = EXPOSURE STEPLENGTH HABITAT_VALUE ELEVATION
DIST_NEAREST_ROAD / DIST=BINARY LINK=LOGIT SOLUTION;
RANDOM ID YEAR;
I want to see if the different independent variables influence the
observation of deer throughout the hunting season. My question is what are
the assumptions that I need to adhere to with logistic regression. I read
that the data does not need to be normally distributed. I know "steplength"
is extremely right skewed with the mean of 48 meters and a max value of 1,400
meters. If normality is not an issue then I assumed the next step would be to
at least examine the residuals and remove some of those extreme movements. I
added the PLOT=RESIDUALPANEL option to my model with ODS GRAPHICS and plotted
the residuals. The residuals looked very different than what I'd see in a
PROC MIXED model and I was unable to interpret the plots to determine if I
need to remove any outliers. Will I not receive a normal residual plot,
similar to PROC MIXED? If so, how do you interpret residual plots from PROC
GLIMMIX. Thank you very much!
This will give the residuals both using the random effect predictors (conditional) and averaging over the random effects (marginal). I don't know if influence statistics (Cook's D, DFFITS) are available for GLIMMIX.
I have one question about the variable ID--does it refer to an individual deer, and if so are there repeated observations on that animal? Then some spatial modeling might be in order as well, or grouping variances by animal, or, well, a whole bundle of things, but probably not relevant to your question about the plots.
Yes, ID refers to an individual deer. I tried running the model with different covariance structures such as: VC (default), CS, AR(1), and UN. The default covariance structure (VC) provided me with the best fit model based on AICc. I've tried running the spatial power covariance structure in MIXED when I was analyzing movement data but would receive an error message stating that it stopped because of an infinite likelihood. I determined that the error was due having multiple lines of data for one indvidual deer. Unfortunately, I wasn't sure how to overcome this and was told by a statistician to use another covariance structure. Thank you for your help!
Aha! The infinite likelihood caused by multiple lines per subject problem.
You can fix this by respecifying the subject, so that instead of subject=ID, you use subject=ID*. Something makes each line unique, and it should be included on the CLASS statement. A good guess would be one of the fixed effects, say exposure (just a guess, not sure at all). If that is the situation then subject=ID*EXPOSURE might fix the infinite likelihood. It might get more complex to the point that subject=ID*EXPOSURE*HABITAT_VALUE may be needed.
This still doesn't address the residual plot problem. I keep hoping someone will drop a hint in here.
As indicated elsewhere, you must have multiple observations for each ID (individual) if you have ID as a random effect. A statement such as plots=residualpanel should give you four graphs, including a 'normal' quantile plot (residual vs. quantile on a normal scale), residual vs. linear predictor (which is the estimate logit here), a histogram of residuals, and a boxplot. I prefer that you use plots=studentpanel to get the studentized residuals (actually conditional studentized residuals here). Easier to spot outliers.
Since your response is binary (0/1), these diagnostic plots are challenging. Although you can do all the standard residual plots, but as stated by David Collett in Modelling Binary Data, "some of them become difficult to interpret." You can get strange looking residual plots. The Collett book has an excellent chapter on GLM diagnostics, although he does not deal with random effects (in that chapter).
GLIMMIX does not (yet) have formal influence diagnostics (as found in MIXED).