Hi Steve, apologies for the great length of time between your insightful reply and this response. I don't think you're in error at all on this, I think you're right on the money. However, even if we've identified WHAT is happening, I am still unclear on WHY. First, to cover WHAT is happening. After a LOT of thought I have arrived at the following understanding: I think we are both referring to the equation in the documentation found here SAS Help Center: Proportions (notation explained here SAS Help Center: Definitions and Notation) which seems to be exactly the Taylor series approximation for calculating standard errors. I confess, I certainly couldn't derive this equation, but I suppose the equation uses an approximation on how the central estimate varies as different 'elements' are left out (I've retrofitted this explanation tbh, to try and make the rest make sense). The equation then calculates the variance in this approximation. This is different to the variance in the equation of StdErr = sqrt(Variance / n) that we learn in basic statistics because in this simple equation the variance is the variance of all measurements, whereas in the Taylor expression variance is the variance of mean values (similar to a bootstrap method). Looking at the Taylor expression, it would appear that individual data are not important to the calculation, as long as the means of each cluster remain unchanged. Where does this approximation come from? Why would the approximation behave like this, with respect to individual data? It must be that the 'elements' that are left out in the construction of this approximation are entire clusters (!), as opposed to individual data, an idea that is reflected in your insight into degrees of freedom. However, I want to return to the question of WHY. Why is the approximation constructed to behave this way? Surely this would appear to be a flaw in the calculation? Is it not intuitive that increasing the sample size should decrease uncertainty in the estimate? If I have left clustering unchanged, there is 1 component of uncertainty that should indeed remain unchanged; the uncertainty from random cluster selection. But another component of uncertainty, the uncertainty within clusters, has reduced and should surely be reflected somehow in the result. To use the end of my previous reply as refrain; otherwise a survey that samples 2 units per cluster would be apparently indistinguishable in terms of data quality from one that samples 10,000 units per cluster. Does it not seem like this shouldn't be so? Many thanks in advance to any that venture to reply to this :).
... View more