Handling complex survey designs in Proc Glimmix

AmberDetty · Posted 02-29-2012 04:01 PM

This is the first time I've used proc glimmix procedure in SAS.

Here's the basics: we are looking to use both indivdiual student health/demographic data and school-level geographical/size data to predict school performance. The data sources are two-fold, the first a stratified, clustered random sample of third-graders were taken to provide the individual-level data. That data was then merged with data from the state department of education on school performance indicators.

My question is, how do you handle the strata and cluster variables in proc glimmix most effectively. Some sites have suggested just running both of those variables in a random statement, some suggest using the strata variable in the random statement and the cluster variable in the subject option of the random statement. I get this error when doing the latter: "ERROR: Integer overflow on computing amount of memory required." I think this is occuring because of the large number of options, then, in the random statement (there are 377 levels of the cluster variable, over a 100 options in the strata variable). I'm not sure how to fix this.

To make matters even more complicated, we're interested in running stratified regression models based upon whether the school had a school-based program of interest.

Here's the current code I have, where strata09_n is the strata varaible and buildingirn is the cluster variable and sealprog is the school-based program of interest:

Option 1:

Proc glimmix data= a; *final model;

class ctytype untx ow2 strata09_n buildingirn;

model perfindex = lowincome racialmin untx ow2 ctytype enrollment3 / solution;

weight finalweight;

random strata09_n buildingirn / solution;

lsmeans ow2/ilink cl;

lsmeans untx/ilink cl;

where sealprog = 0; *where sealprog = 1;

run;

Option 2:

Proc glimmix data= a; *final model;

class ctytype untx ow2 strata09_n buildingirn;

model perfindex = lowincome racialmin untx ow2 ctytype enrollment3 / solution;

weight finalweight;

random strata09_n / subject=buildingirn solution;

lsmeans ow2/ilink cl;

lsmeans untx/ilink cl;

where sealprog = 0; *where sealprog = 1;

run;

Any advice/thoughts?

SteveDenham · Posted 03-01-2012 09:35 AM

I probably have the cluster and the strata variables confused, so bear with me on this. You have 377 buildingirn that are nested within ~100 strata09_n variables. I assume that there are multiple records for each of the individual buildingirn's (probably individual student responses). If that is incorrect, then ignore everything from here on, and we'll try again after I get it straight.

Given that my assumption is correct, then the best choice would look like:

Proc glimmix data= a; *final model;

class ctytype untx ow2 strata09_n buildingirn;

model perfindex = lowincome racialmin untx ow2 ctytype enrollment3 / solution;

weight finalweight;

random intercept strata09_n / subject=buildingirn solution;

lsmeans ow2/ilink cl;

lsmeans untx/ilink cl;

where sealprog = 0; *where sealprog = 1;

run;

I can't guarantee that this will get around the out of memory problem, though. So you may have to consolidate some strata (which I am going to guess are school districts, and it is vital to get some info on the variability added there). That may be difficult given your objectives. You may need to look into PROC HPMIXED, which is designed to handle data with many random effects.

But now I am going to ask some questions: First, why PROC GLIMMIX rather than PROC SURVEYREG? If you truly conducted a survey, the estimates from GLIMMIX may be biased, even with the weight statement. Second, I see "ilink" options in the lsmeans statements. Generally, I don't use this unless I have specified a distribution in the model that is other than normal or lognormal (which have identity links so that the ilink estimate is identical to the estimate). Is the response variable what you would consider normal, or would another distribution be preferred? Third, I see "final model" in the comments. I hope that this is the result of looking at the interactions between the continuous variables and the class variables to find a parsimonious analysis of covariance model.

Good luck with this.

Steve Denham

AmberDetty · Posted 03-01-2012 09:49 AM

Thanks, Steve.

Yes, basically we have about a 100 county/income stratifications (not school districts) and 377 schools neslted within those. Thanks for the reply, I'll attempt that code and see how the output looks.

We had been advised to use proc glimmix by our statstical contractor due to the multilevel aspect of these models. However, they were not as helpful in being able to determine the best way to handle the sampling design variables in the proc glimmix statements, as SAS is not their primary statsitical software. They also advised us to include the ilink options, but you are correct that we probably do not need them because our analysis has determined that our distribution is fairly normal. I'm definately open to a discussion of whether proc surveyreg or proc glimmix are our better option here, especially because I am so much more familiar with proc surveyreg. I will look into proc hpmixed as well.

We have fairly extensively looked at the interactions between the variables and done extensive covariance testing to come to this model. It also led to our decision to run the stratified models after seeing a signficant interaction between our student health variables and the indicator of the school health programs. While this model is not set in stone, it is the best we've come up with so far which is why I labled it final (at least for now).

Thanks again,

Amber

SteveDenham · Posted 03-01-2012 09:58 AM

If you are familiar with surveyreg, I think it is the right choice. You have cluster variables, strata variables, a weight variable, and what looks like an interesting model. The data were collected in a survey, and the inference space is the universe (population) where the survey was applied--not the typical mixed model inference space of all possible populations. Consequently, the tests and confidence bounds produced by surveyreg are more likely to be correct for that particular population.

Nickel says it runs without a memory problem, as well.

Steve Denham