07-23-2016 06:49 PM
I am analysing part of a weighted survey sample using the SURVEYLOGISTIC procedure, and have used the DOMAIN statement to identify records that I want to include in the analysis. However, the regression parameter estimates pertaining to the subsample I am analysing:
(a) include estimates for dummy variables that don't appear in the subsample even once (how would they be calculated!?)
(b) have variance estimates which are actually smaller than when I simply analyse the subsample using the BY command.
Neither of these things make sense! I would be extremely grateful for any insights.
07-24-2016 10:49 PM
I'm not very clear at all about you topic (a). I think you might have to show more detail about the structure of your data set, your surveylogistic code and maybe some of the results. Logistic and surveylogistic create dummy variables for each level of a categorical variable. Could that be what you're seeing?
b. When you do BY then teh variance is only calculated for records within the by group otherwise variance from the entire data set contributes to the domain analysis. If you read the documentation on the survey procs there is some more detail on this.
07-25-2016 02:31 PM
Hi. One of my regression covariates is LIMITK, a categorical variable related to having a work-limiting condition. The variable has four levels:
-9 (not applicable),
-8 (did not answer),
1 (condition affects type of work undertaken); and
2 (condition does not affect type of work undertaken).
I use the indicator variable SUBCLASS to flag members of my subclass of interest, by including the clause DOMAIN=SUBCLASS.
Crucially, there are no records in SUBCLASS=1 that are coded -8 for LIMITK. Nevertheless, the SAS output provides an estimate for the coefficient of the LIMITK dummy corresponding to the value -8. My best guess is that SAS finds it more expedient to keep the estimator the same across both domains, but to model on zeros where there are no applicable values. This approach, while computationally expedient, would not affect the estimated vector parameter beta-hat. Does that make sense to you?
Another thing I'm wondering is, is there much sense even bothering with DOMAIN analysis if the overall size of the sample runs into hundereds of thousands? Even if the domain consists of a quarter of records, the variance in the estimate of its size will be the variance of the proportion of a sample, p(1-p)/n, where n is very large. If this formula is at the heart of what SAS is adding to the process when the DOMAIN command is being used (and I'm assuming it is based on my reading of Kish (1965)), it seems to suggest that it really isn't worth the bother.
07-25-2016 04:03 PM
DOMAIN analysis is often done to look at differences between subpopulations: ie men vs women, buyers of product A vs Product B,. The large size will tend to point to a significant difference that may not have much aboslute difference but it is still worth doing in many cases.
The purpose of your analysis may have you conclude that the statistical significance isn't import for your use but for those differences are where the interest lies.