07-07-2016 03:04 PM
I have a complex survey sample, and want to perform logistic regression on a subsample. I retain a survey observation in my subsample if a variable D takes value 1, and reject it if the variable D takes value 0. A drawback is that I don't know what proportion of the overall population (sampling frame) has D=1. D does not define a stratum or a cluster in the design, and occurs in various proportions among the different strata.
Leslie Kish, in his classic book 'Survey Sampling' (1965) calls analysis of such a subsample "subclass analysis", and gives formulas for the estimation of a mean of a variable across such a subsample, as well as the variance of the mean. The variance, in particular, is inflated because of uncertainty around the true proportion of the population in each stratum for which D=1 holds.
In THE SAS/STAT procedure SURVEYLOGISTIC, all that is necessary to achieve correct estimation of regression parameters in this situation is to include the statement DOMAIN=D. However, I have reviewed the full documentation for SAS/STAT, as well as a range of methodological papers, and nowhere is it clear exactly what formula PROC SURVEYLOGISTIC uses to estimate variance of logistic regression parameters. I am writing a research paper, and need to satisfy myself as to exactly what the software is doing to analyse my data.
If anybody knows where I might find the exact mathematics behind the DOMAIN statement in PROC SURVEYLOGISTIC, I would be very grateful for a pointer.