Hi, I need help with the SAS code for running Logistic Regression reporting Robust Standard Errors.
Here is my situation -
Data structure - 100 records, each for a different person. And these 100 individuals are in 20 separate clusters; and there is dependency within the clusters, and the dependency structure is very flexible.
Cluster ID: CID
Dependent variable - EVENT (1, 0)
Independent variables - V1 (numeric values), V2 (categories: 1, 2, 3, 4, 5)，V3
(dummy variable: 1, 0). And all these variables are personal characteristics, and there is no cluster characteristic at all.
So, what would be the right setup in SAS PROC to run this model in order to get the robust S.E. to account for the dependency within clusters?
I did some search on the web, and it appeared that proc surveylogistic or proc genmod may be the solution, but I am unable to come up with the detailed codes to cover every aspect of my model. So, I'd like to get help from the experts in here.
I would need a lot more information to give specific advice on an appropriate model. However, your desire for ROBUST SEs is unclear. The rest of your message suggests that you may need to fit a generalized linear mixed model to your data, with the binomial conditional distribution and probably the logit link. The standard errors are not labeled "robust" for this type of analysis. CID would be a random effect (the way I understand your message, but I could be missing important information). If EVENT is really a 0/1 variable (not a number out of n), I would try the following (for a model with no interactions):
proc glimmix ;
class CID V2;
model EVENT = V1 V2 V3 / dist=binomial link=logit;
lsmeans V2 / cl ilink;
Results are on the logit scale.Note: you would really need to do some reading about this procedure, and SAS for Mixed Models, 2nd Ed. (2006), is a good place to start. So is example 1 in the GLIMMIX User Guide.
What code have you come up with, and what aspects of the design are not covered by the code which you currently have at hand? Your request is rather vague.
I would note that the GLIMMIX procedure might be another option for you to consider. The GLIMMIX procedure allows greater flexibility in specifying the within-cluster dependency structure than the GENMOD or SURVEYLOGISTIC procedures. The GLIMMIX procedure also supports computation of robust (sandwich) variance estimates through the EMPIRICAL option on the MODEL statement.
The EMPIRICAL option is a good way in GLIMMIX to get so-called robust estimates . Note that this is put in the GLIMMIX statement, not the MODEL statement. Further note: as I put in my first reply, there is a great deal of uncertainty about the appropriate model for your data. I gave one possibility, where CID was a subject (e.g., such as a clinic, location for a field trial, etc.), and one models dependency within the CID levels as a variance-component model. But, there are many other possibilities, and these can be handled with REPEATED statements in GENMOD, as indicated in other replies, or with a
RANDOM _residual_ / sub= .... type = ...;
statement in GLIMMIX. It is in the latter case that the EMPIRICAL option comes into play in a major way. But as I also suggested in my previous reply, there are many modeling options, which would require a fairly detailed discussion to figure out the best choices.
Right! The EMPIRICAL option is specified on the GLIMMIX invocation statement. And you are also correct that the problem is not well specified. The original poster wants specific code, but has not provided a complete specification of the problem. He/she has only indicated an awareness of some issues having to do with correlated responses, but has not indicated enough about how the data were collected to really advise on appropriate code.
If you use PROC GENMOD with the REPEATED statement, a robust variance estimator is used by default. Generalized Estimating Equations (GEE) estimation is used to fit the model. Note that 20 clusters might be a little small for this method. For a logistic model, use the DIST=BIN option in the MODEL statement. In the REPEATED statement, start with simple structures if you can such as TYPE=IND or TYPE=EXCH. The most general correlation structure, TYPE=UN, is the most difficult to estimate and can result in fitting problems. See the examples in the GENMOD documentation: