New Contributor
Posts: 4

# Appropriate analysis for stratified sample data

I would like to model the association between one binary dependent variable and one ordinal exposure variable.  The data is from a survey that used a 2 stage sampling design. In the first stage, the sample is stratified by community.  In the second stage, participants are randomly selected within these strata.

My question is, what is the benefit of using the surveylogistic procedure over a genmod procedure with gee variance estimations (to account for correlations within strata)?  Or is there none?

Posts: 2,655

## Re: Appropriate analysis for stratified sample data

The most obvious difference is the inference space that each will generalize to.  Surveylogistic will generalize to repeated sampling from fixed populations defined in the sampling design.  GEE will generalize to repeated sampling from infinite populations defined on the fixed strata.  And when I read that, it sounds odd.  So, one more time: Survey: Fixed population with n samples, where n increases without bound, GEE: All possible populations with fixed strata effects.  Subtle but real in terms of calculating variances.

The benefit of one or the other will be determined by your research objective: Characterization of an existing population via sample vs. characterization of fixed effects in an effectively infinite population (infinite realizations of a population with a fixed strata structure is better, I guess).

I just want to get away from defining anything with "infinite" in the sentence.

At this point, you probably want to punch me in the nose for not giving a straight answer.

Steve Denham

New Contributor
Posts: 4

## Re: Appropriate analysis for stratified sample data

Posted in reply to SteveDenham

Thank you Steve.  So, to restate what you said, the surveylogistic assumes that the sampling is done within a fixed population with bounds and generalizations from the research only extend to the fixed population.  Whereas, genmod+gee assumes the sampling is done in a fixed population with an infinite number of possible n's sampled.

How specifically will variance estimates be affected by each of these procedures? (Sorry this may seem like a silly question, but I'm working to improve my statistical understanding!)  Thank you.

Posts: 2,655

## Re: Appropriate analysis for stratified sample data

The best I can do is point at the documentation.  PROC SURVEYLOGISTIC has an extensive section on variance estimation in the Details folder.  GEE uses maximum likelihood techniques that yield a covariance matrix of the parameters, with no adjustment for sampling.  The estimates are asymptotic in that sense, while the survey based models apply sampling and poopulation corrections.  If I go any farther here, I will certainly be out of my experience range, so I hope others might step in with more concrete examples.

Steve Denham

Regular Contributor
Posts: 152

## Re: Appropriate analysis for stratified sample data

Read the following reference that explains differences between design-based analyses like those in the SAS survey procedures and model-based analyses like those using multilevel mixed models (for example, PROC MIXED or PROC GLIMMIX):

Graubard BI, Korn EL.  Modeling the sampling design in the analysis of health surveys.  Statistical

Methods in Medical Research 1996;5:263-281.

Sometimes the variance of parameters (for example, means) from design-based analyses are less than that from model-based analyses, but other times the reverse is true.  For example, as the number of primary sampling units increase, the variances estimated during design-based analyses are approximately unbiased but may be biased if this number is small.   If the values or the variability of a parameter are associated with the number of observations in a primary sampling unit, then the variances of the parameter estimated during model-based analyses may be biased. The authors generally prefer design-based analyses to model-based analyses because the former requires fewer assumptions than the latter.

The characteristics of the survey design in design-based analyses (the strata and the primary sampling units) are considered "nuisances" and of little interest for the analyst.  owever, if these characteristics of the survey design may be associated with the parameters of interest (as above), and if these characteristics are available to the analyst (which is uncommon), then modeling that accounts for these characteristics may be worthwhile (for example, in reducing the bias of model-based estimates of variance).

Although this article does not specifically compare design-based analyses with analyses based on generalized estimating equations, the same considerations probably apply.

Discussion stats
• 4 replies
• 464 views
• 6 likes
• 3 in conversation