BookmarkSubscribeRSS Feed
AlexB-B
Calcite | Level 5

I am experimenting with standard errors, and the affect that changes to sample size has on them in complex survey designs. I have created a dummy data set using the following code:

 

%let SampleSize = 20000;

data testbed;
	call streaminit(123);

	Cluster=0;
	do Key = 1 to &SampleSize;
		Value = 0;
		do while (Value < 1 or Value > 10);
			Value = floor(rand("Normal", 5, 3));
		end;
		
		if mod(Key-1, 40) = 0 then Cluster = Cluster + 1;

		output;
	end;
run;

If I run the following code:

title "500 clusters";
proc surveyfreq data=testbed;
	cluster Cluster;
	tables value;
run;

I get the following result:

AlexBB_1-1657279438470.png

I then create a dataset where the only difference is that there are double the number of records, i.e. each record is duplicated (primary key is also adjusted):

proc SQL;
	create table testbeddouble as
		select *, 1 as CloneNum from testbed
			union all
		select *, 2 as CloneNum from testbed;
	update testbeddouble
		set Key = Key+&SampleSize
		where CloneNum = 2;
quit;

And run the following code:

title "500 clusters and doubled data";
proc surveyfreq data=testbeddouble ;
	cluster Cluster;
	tables value;
run;

I get the following result:

AlexBB_2-1657279665955.png

None of the standard error estimates have changed, despite a large increase in sample size. I do not see this affect if I don't use clustering, only if I do use clustering. Changes to the number of clusters does have an effect on standard errors, but for a given number of clusters, changes to the sample size has no affect.

 

Can anybody explain to me what's going on? Am I wrong to expect the standard errors to change? Why not?

4 REPLIES 4
SteveDenham
Jade | Level 19

This is a guess only, but if both samples have the same number of clusters (and the data is simply duplicated), I would expect the means (proportions) and standard errors to be identical.

 

SteveDenham

AlexB-B
Calcite | Level 5

Hi Steve, thanks for your response. You are effectively saying that I should not expect the standard errors to change in this scenario. But why? For a simple survey design, standard error, as opposed to standard deviation, is inversely proportional to the square root of sample size. Clustering should complicate this relationship, but I would still expect sample size to have some noticeable affect on standard errors. Otherwise a survey that samples 2 units per cluster would be apparently indistinguishable in terms of data quality from one that samples 10,000 units per cluster.

SteveDenham
Jade | Level 19

Well, I went digging through the documentation and came across two things that might affect this.  The first is that the standard error of the mean is the square root of the variance.  Note that this is different to the square root of the variance divided by n, which is what I think you were expecting.  The second thing is how the variance is estimated.  The default is a Taylor series expansion, and it looks like the formulas do involve various values for n by cluster.  However, later on when the documentation talks about degrees of freedom it says:

Taylor Series Variance Estimation

For the Taylor series method, PROC SURVEYMEANS calculates the degrees of freedom for the t test as the number of clusters minus the number of strata. If there are no clusters, then the degrees of freedom equals the number of observations minus the number of strata. If the design is not stratified, then the degrees of freedom equals the number of PSUs minus one.

 

To me, that says that the divisor when calculating the variance is the number of clusters minus the number of strata.  If not stratified, then it is the number of primary sampling units minus one.  There is nothing about the sample size within clusters entering into this calculation.  So to match up the t test calculation to the variance calculation, I would expect you would have to use the same logic. I am really willing to believe I might be in error on this, but based on your example, that is what it looks like is happening.

 

SteveDenham

 

AlexB-B
Calcite | Level 5

Hi Steve, apologies for the great length of time between your insightful reply and this response.

 

I don't think you're in error at all on this, I think you're right on the money. However, even if we've identified WHAT is happening, I am still unclear on WHY.

 

First, to cover WHAT is happening. After a LOT of thought I have arrived at the following understanding:

I think we are both referring to the equation in the documentation found here SAS Help Center: Proportions (notation explained here SAS Help Center: Definitions and Notation) which seems to be exactly the Taylor series approximation for calculating standard errors. I confess, I certainly couldn't derive this equation, but I suppose the equation uses an approximation on how the central estimate varies as different 'elements' are left out (I've retrofitted this explanation tbh, to try and make the rest make sense). The equation then calculates the variance in this approximation. This is different to the variance in the equation of StdErr = sqrt(Variance / n) that we learn in basic statistics because in this simple equation the variance is the variance of all measurements, whereas in the Taylor expression variance is the variance of mean values (similar to a bootstrap method). Looking at the Taylor expression, it would appear that individual data are not important to the calculation, as long as the means of each cluster remain unchanged. Where does this approximation come from? Why would the approximation behave like this, with respect to individual data? It must be that the 'elements' that are left out in the construction of this approximation are entire clusters (!), as opposed to individual data, an idea that is reflected in your insight into degrees of freedom.

 

However, I want to return to the question of WHY. Why is the approximation constructed to behave this way? Surely this would appear to be a flaw in the calculation? Is it not intuitive that increasing the sample size should decrease uncertainty in the estimate? If I have left clustering unchanged, there is 1 component of uncertainty that should indeed remain unchanged; the uncertainty from random cluster selection. But another component of uncertainty, the uncertainty within clusters, has reduced and should surely be reflected somehow in the result. To use the end of my previous reply as refrain; otherwise a survey that samples 2 units per cluster would be apparently indistinguishable in terms of data quality from one that samples 10,000 units per cluster. Does it not seem like this shouldn't be so?

Many thanks in advance to any that venture to reply to this :).

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 684 views
  • 1 like
  • 2 in conversation