Re: Subgroup Variance Estimation in complex surveys with 9.3

EmoryPulm · Posted 02-19-2013 11:38 AM

I am anlayzing data of a subgroup of complex, multistage survey: NHANES. In the analytic documentation on the NHANES' site, they describe that SAS 9.1 and 9.2 do not correctly calculate variance since they do not correctly calculate degrees of freedom. It goes on to explain that these versions of SAS to do not account for strata and PSUs with missing data. My question is whether SAS 9.3 has fixed this miscalulation or not.

I have included the NHANES analytic notes below if they are helpful. Thanks!!

Key Concepts About Degrees of Freedom for Performing Statistical Tests and Calculating Confidence Limits

Degrees of Freedom and NHANES Subgroups

Estimates are often calculated for various subgroups of interest within the total NHANES population. When the number of first stage sampling units (PSUs) is small, the z-statistic should be replaced by a value from a t-distribution when computing confidence limits for these estimates (see SUDAAN 1995 — ref from NHANES III analytic guidelines).

To calculate the correct value for the t-statistic from a t-distribution and a selected level of significance, you must calculate the proper degrees of freedom for the estimate .

In addition, it is important to examine the number of degrees of freedom from which a standard error estimate is based. Continuing research on issues related to stability of variance estimates in subdomains of NHANES have been published and show that standard error estimates based on small numbers of paired PSUs (i.e., degrees of freedom) are prone to instability.

The reliability of the estimated standard error, as measured by its relative standard error (i.e., (standard error of the standard error of the estimate/standard error of the estimate)*100), is inversely proportional to its degrees of freedom. As the number of degrees of freedom increases, the relative standard error decreases and the reliability of the estimate increases. The NHANES guidelines recommended a relative standard error of at most 30%. This corresponds to at least 12 degrees of freedom.

Degrees of freedom are properly calculated by subtracting the number of clusters in the first level of sampling (strata) from the number of clusters in the second level of sampling (PSUs) for each subgroup you are analyzing as shown the in equation below.

Equation for Degrees of Freedom

deg of freedom = # of PSUs - # of strata

Differences in Degrees of Freedom for Subgroups in SUDAAN and SAS Survey Procedures

For both SUDAAN and SAS Survey procedures, the degrees of freedom are calculated in the same way when looking at the entire sample population or in subgroups where all strata and PSUs are represented.

However, when you analyze data on a subgroup of sample persons who may not be represented in all strata and PSUs (e.g., Mexican Americans), the degrees of freedom provided in the output may differ. For example, SUDAAN will correctly count the number of PSU's and strata with at least one valid observation for each cell of the table being requested. In contrast, SAS 9.1 Survey procedures, such as proc surveymeans, compute the degrees of freedom as the number of clusters (PSUs) in the non-empty strata minus the number of non-empty strata. This means that if your data have empty strata (no persons in the population for either PSU) the number of degrees of freedom will increase. This is incorrect and SAS is currently working on correcting this problem. For more information on methods of correctly calculating degrees of freedom using SAS 9.1 Survey procedures, please see the following two SAS 9.1 Survey procedures macros.

1zmm · Posted 02-21-2013 08:24 AM

Since you did not provide example output comparing SAS's PROC SURVEYxxx procedures with SUDAAN's comparable procedures,

I can suggest only that you read the corresponding SAS version 9.3 documentation on how its PROC SURVEYxxx procedures calculate degrees of freedom (for example, PROC SURVEYMEANS, pages 7430-7431):

http://support.sas.com/documentation/cdI/en/statug/63962/pdf/default/statug.pdf.

If the data have empty strata, won't the PSUs within those strata also be empty and also not be counted in the calculation of degrees of freedom? Thus, the number of degrees of freedom will not necessarily increase if the number of empty PSUs exceeds the number of empty strata including those empty PSUs. For example, the formula for degrees of freedom can equal either

1) DF = # PSUs - #strata, or

2) DF = # non-empty PSUs - # non-empty strata.

If there were 50 PSUs and 5 strata, DF according to formula # 1 would equal 50 -5 = 45 df.

If each stratum has on average 10 PSUs, and if one of these strata were empty, then the number of non-empty strata would equal 4, and the number of non-empty PSUs would equal 40 so that the DF according to formula #2 would equal 40 -4 = 36 df. Using formula #2 thus leads to an estimate of fewer DF than using formula #1, so that formula #2 is more statistically conservative.

ballardw · Posted 02-21-2013 11:38 AM

It has been a few years since I worked with SUDAAN on a regular basis but when we had strata with single observations a log note said something about borrowing variance from the following strata. Considering our data at the time meant that the following strata was an entirely separate geographic region I always wondered about that.

1zmm · Posted 02-21-2013 01:48 PM

When SUDAAN "discovers" a stratum with only one PSU in the data, it calculates a "variance" of the mean for that stratum as the square of the difference between the observed PSU mean and the mean in all the data (the "grand" mean). Obviously, a single value like the observed PSU mean has a variance of zero, but SUDAAN uses this other variance estimate for the stratum instead. I think that SUDAAN uses a similar type of estimator when a PSU has only one observation to calculate the variance of the PSU.

DWilson · Posted 03-27-2019 03:57 PM

SUDAAN uses # PSUs - # strata to calculate degrees of freedom for overall and sub-population estimates.

SUDAAN does not currently have a means of calculating the degrees of freedom using only those PSUs with at least 1 member of the sub-population. You can specify your own degrees of freedom in SUDAAN, however.

SAS 9.4 survey procedures have a switch/option to calculate # PSUs - #strata using only those PSUs with at least 1 member of the sub-population.

STATA uses # PSUs - #strata using only those PSUs with at least 1 member of the sub-population by default.

The issue with degrees of freedom and replicate-based variance estimation is more complex.

SAS and STATA use the number of replicates (or # of replicates -1 ) as the default degrees of freedom.

The R Survey package assumes an infinite number of degrees of freedom

Note that the default degrees of freedom for replicate-based variance estimation can be much higher than the corresponding degrees of freedom if using #PSUs - # strata among those PSUs with at least 1 member of the subpopulation.

You can specify your own degrees of freedom when using replicate-based variance estimation in all three software programs: SAS, STATA, and the R Survey package.