09-04-2014 04:19 PM
I need to resample from a data set that has clusters, strata and weights. The survey design is such that, properly analyzed, results should be representative of a much larger population (the weights can be quite large). I want to resample the data to yield a representative sample of the larger population where each item has weight 1. That is, if there are 10,000 weighted entries in the original sample (in strata and clusters), each with average weight W, thus representing 10,000*W people, I want a new random resample of (say) 20,000 entries each with weight 1 and with no strata or clusters.
Can this be done in SAS?
09-04-2014 04:37 PM
I think that if you use your current weight variable as a FREQ variable that will make the sample frame the 10000*W though you may need to round the existing weight variable to an integer (not sure). The output dataset will have a weight but you can reset that to 1 OR basically in any further analysis do not include any weight variable as the default will be to treat each record as having a weight of 1.
You can either list the variables to keep in an ID statement or drop the strata and clusters from other analysis. My feeling though would be to leave the variables in the data just in case some wants to see analysis on at least the strata variables.
09-05-2014 09:55 AM
Yes, using FREQ allows me to get correct distributions. The problem is that NPAR1WAY doesn't do two-sample Kolmogorov-Smirnov or Wilcoxon Rank Sum tests for weighted data. I'm trying to find a way to trick it by converting a small sample of weighted data into very large sample of unweighted data, then resampling back to its original size so that NPAR1WAY will give correct results. I know of no other way of getting SAS to calculate these statistics correctly.
So I believe I really need to resample from a weighted, complex survey sample down to a simple unweighted survey sample.
09-05-2014 10:31 AM
But since NPAR1WAY does allow use of a FREQ variable, I would try using your weight variable for FREQ in the procedure and not subset the data.
09-05-2014 11:55 AM
Yes, I did that, and realized NPAR1WAY was calculating things incorrectly because the frequency is not the same as the weight, and all it takes is the frequency (and I have a weight).
E.g. the Kolmogorov-Smirnov p-value is always near zero because it uses the sum of the frequencies in the calculation, but the statistical precision goes as the number of entries (i.e. as 10,000, not as 10,000*W). Same for Wilcoxon Rank Sum tests, using the weight as a frequency NPAR1WAY thinks there are 10,000*W people and thus the p-values are always miniscule. For the K-S tests I can in principle recalculate p from D, n1 and n2. But I cannot do anything about the Wilcoxon calculations. they are wrong and I can't recalculate them.
Thus the need to create the new unweighted, resampled data set from the original weighted set.