Poststratification with PROC SURVEYMEANS

Overview

This example uses PROC SURVEYMEANS to obtain poststratified totals, means, and ratios. The data are sampled from county-level data sets that are publicly available from the USDA Economic Research Service website, at http://www.ers.usda.gov/data-products/county-level-data-sets.aspx. The sample consists of the county-level information about population size, the number of individuals in the labor force, and the number of unemployed persons in the 48 contiguous states of the United States of America in 2011. The sampling frame is stratified by state, and a simple random sample of two counties per state is selected. The analysis consists of a comparison between the non-poststratified estimates and the poststratified estimates of the total and average labor force size, number of unemployed, population size, and two ratios: the unemployment rate and the labor force participation rate. Table 1 describes the contents of the sample data set Unemployment, and Table 2 describes the interpretation of the six levels of the National Center for Health Statistics (NCHS) urban-rural classification for each county.

Table 1: Example Data Set Unemployment

Variable	Description
FIPS	Federal information processing standards (FIPS) code for counties
ST_FIPS	FIPS code for states
State	Abbreviation of state name
County	County name
Code2006	National Center for Health Statistics (NCHS) 2006 urban-rural classification code
Population	Resident total population estimate as of July 1, 2011
LaborForce	Number of individuals in the civilian labor force in 2011
Unemployed	Number of unemployed individuals in 2011
SamplingWeight	Sampling weight generated by yhe SURVEYSELECT procedure

Table 2: 2006 NCHS Urban-Rural Classification Scheme

Code	Urbanization Level	Classification Rules
1	Large metro, central	Counties in micropolitan statistical area (MSA) with population of 1 million
		or more that have the following characteristics:
		1) contain the entire population of the largest principal city of the MSA, or
		2) are completely contained within the largest principal city of the MSA, or
		3) contain at least 250,000 residents of any principal city in the MSA
2	Large metro, fringe	Counties in MSA with 1 million or more population that do not qualify as large central
3	Medium metro	Counties in MSA with 250,000–999,999 population
4	Small metro	Counties in MSA with 50,000–249,999 population
5	Micropolitan	Counties in micropolitan statistical area
6	Noncore	Counties not in micropolitan statistical area

The following SAS statements create the SAS data set Unemployment:

data unemployment;
  input FIPS 1-5 ST_FIPS 7-8 State $ 10-11 County $ 13-34 Code2006 35
        Population 37-45 LaborForce 46-52 Unemployed 53-58
        SamplingWeight 59-64;
  datalines;
1005  1  AL Barbour County        5 27313    9761   1110  33.5
1019  1  AL Cherokee County       6 26094    11696  1020  33.5
4021  4  AZ Pinal County          2 383553   139864 14466 7.5
4027  4  AZ Yuma County           4 200374   89500  24270 7.5
5105  5  AR Perry County          3 10384    4788   414   37.5

   ... more lines ...

55119 55 WI Taylor County         6 20759    10406  915   36.0
56025 56 WY Natrona County        4 76356    42907  2537  11.5
56037 56 WY Sweetwater County     5 44078    25138  1271  11.5
;
run;

You begin the comparative analysis by using PROC SURVEYMEANS as in the following statements to estimate the means, totals, and ratios of interest. The MEAN and SUM keywords in the PROC SURVEYMEANS statement request estimates of the population means and totals, respectively. The VAR statement requests estimates of the variables LaborForce, Unemployed, and Population. So, for example, if you specify the keyword MEAN in the PROC SURVEYMEANS statement and the variable Unemployed in the VAR statement, you are requesting an estimate of how many unemployed persons, on average, reside in a county. The first RATIO statement requests an estimate of the population’s unemployment rate, which is the ratio of the number of unemployed to the size of the labor force. The second RATIO statement requests an estimate of the labor force participation rate, which is the ratio of the size of the labor force to the size of the population of the county. The STRATA and WEIGHT statements identify the sampling design: the STRATA statement specifies that the strata are identified by the variable ST_FIPS, and the WEIGHT statement specifies that the sampling weights are contained in the variable SamplingWeight.

proc surveymeans data=unemployment mean sum;
  strata st_fips;
  weight SamplingWeight;
  var LaborForce Unemployed Population;
  ratio 'Unemployment Rate' Unemployed / LaborForce;
  ratio 'Labor Force Participation Rate' LaborForce / Population;
run;

Output 1 displays the estimated means, totals, ratios, and their standard errors. For example, on average there are 110,064 individuals in a county and 53,472 individuals in the labor force, and 4,925 individuals are unemployed. On average, the unemployment rate is 9.2%, and the labor force participation rate is 48.58%.

Output 1: Stratified Design

The SURVEYMEANS Procedure

Data Summary
Number of Strata	48
Number of Observations	96
Sum of Weights	3108

Statistics
Variable	Mean	Std Error of Mean	Sum	Std Dev
LaborForce	53472	6488.570784	166190527	20166478
Unemployed	4924.943050	594.657745	15306723	1848196
Population	110064	13105	342078597	40729501

Ratio Analysis: Unemployment Rate
Numerator	Denominator	Ratio	Std Err
Unemployed	LaborForce	0.092103	0.003090

Ratio Analysis: Labor Force Participation Rate
Numerator	Denominator	Ratio	Std Err
LaborForce	Population	0.485826	0.004186

In addition to the sample, the NCHS urban-rural classification code (Ingram and Franco, 2012) for each county in the sample and the total number of counties in the population that have each of the six levels of the NCHS classification are known. If the totals, means, and ratios of the variables of interest are homogeneous for counties that have the same NCHS urban-rural classification, but there is significant heterogeneity between counties whose classifications differ, then poststratifying by the NCHS urban-rural classification can potentially yield more efficient estimates.

The following SAS statements create the poststratum totals data set Poststrata. This data set is to be used in the PSTOTAL= option of the SURVEYMEANS procedure’s POSTSTRATA statement. A poststratum total data set must contain all the poststratification variables that are listed in the POSTSTRATA statement, and it must have a variable named _PSTOTAL_ that contains the poststratum totals. In the Poststrata data set, the variable Code2006 contains the poststratum identification code, and the variable _PSTOTAL_ contains the total number of counties in that poststratum in 2011.

data poststrata;
  input Code2006 _PSTOTAL_ ;
  datalines;
1 62
2 354
3 329
4 340
5 688
6 1336
;
run;

Figure 1 compares the distributions of Code2006 in the population and the weighted sample. Based on the weighted sample, counties that have values of 3 and 4 are overrepresented in the sample, and counties that have values of 5 and 6 are underrepresented in the sample. Poststratifying on Code2006 reweights the data such that the poststratified weighted sample distribution of Code2006 equals the population distribution.

Figure 1: Population Distribution versus Weighted Sample Distribution of Code2006

To perform a poststratified analysis, you simply add a POSTSTRATA statement to the SURVEYMEANS procedure, as in the following statements. Specifically, you designate Code2006 as the poststratification variable, and you specify the SAS data set Poststrata in the PSTOTAL= option. The OUT= option saves the poststratification weights to the SAS data set Pswgt.

proc surveymeans data=unemployment mean sum;
  strata st_fips;
  weight SamplingWeight;
  var LaborForce Unemployed Population;
  ratio 'Unemployment Rate' Unemployed / LaborForce;
  ratio 'Labor Force Participation Rate' LaborForce / Population;
  poststrata code2006 / pstotal=poststrata out=pswgt;
run;

Figure 2 shows the ratios of the poststratification weights to the original sampling weights for each category of Code2006. Poststratification reduces the weights for counties that have Code2006 values of 3 and 4 and increases the weights for counties that have Code2006 values of 5 and 6.

Figure 2: Ratio of Poststratification Weights to Sampling Weights

Figure 3 shows that, as expected, the poststratified weighted sample has the same distribution as the population.

Figure 3: Population Distribution versus Poststratified Weighted Sample Distribution of Code2006

Output 2 displays the poststratified estimates and their standard errors. All the poststratified estimates of the population means and totals are smaller than the non-poststratified estimates, but the two poststratified ratio estimates are larger. For example, the poststratified estimates indicate that on average there are 100,215 individuals in a county and 48,755 individuals in the labor force, and 4,518 individuals are unemployed. On average, the unemployment rate is 9.3%, and the labor force participation rate is 48.65%. Without exception, the variances of the estimates are smaller for the poststratified analysis, indicating that the poststratified estimates are more efficient for this sample.

Output 2: Poststratified Analysis

The SURVEYMEANS Procedure

Data Summary
Number of Strata	48
Number of Poststrata	6
Number of Observations	96
Sum of Weights	3108

Statistics
Variable	Mean	Std Error of Mean	Sum	Std Dev
LaborForce	48755	4808.671480	151579056	14950160
Unemployed	4517.976061	477.440072	14046388	1484361
Population	100215	9964.992605	311568502	30981162

Ratio Analysis: Unemployment Rate
Numerator	Denominator	Ratio	Std Err
Unemployed	LaborForce	0.092667	0.002727

Ratio Analysis: Labor Force Participation Rate
Numerator	Denominator	Ratio	Std Err
LaborForce	Population	0.486503	0.003853

Example: Age-Adjusted Mortality Rates

Suppose you want to compare the mortality rates of Florida and California. If you have samples from the two populations, computing the crude mortality rate for each population is straightforward. However, because many health outcomes vary by age and the two populations have different age distributions, a direct comparison of the crude mortality rates might be inappropriate. To make a relative comparison, you can use age-adjusted mortality rates. A common method of computing age-adjusted rates is called direct standardization; it is mathematically equivalent to poststratification.

The following SAS statements create the data sets Florida and California, which contain samples from a one-stage clustered sampling design that has a sampling rate of 0.5; the clusters consist of counties from the respective states, and the observations are age-specific groups. Each observation records the variable FIPS, which identifies the clusters (counties); the categorical variable Age, which identifies the age group; the variable Population, which records the total number of individuals in an age-specific group in 1968; the variable Deaths, which records the total number of recorded deaths in an age-specific group in 1968; and the variable SamplingWeights, which is the inverse of the probability of selecting a county in the sample. The data are sampled from the Compressed Mortality File (CMF), which is publicly available from the Centers for Disease Control and Prevention website, at http://www.cdc.gov/nchs/data_access/cmf.htm#data_availability.

data Florida;
   input FIPS Age Population Deaths;
   SamplingWeight=1.9705882353;
   datalines;
12011 4 7730 177
12011 5 32956 44
12011 6 49587 22
12011 7 49407 23
12011 8 40175 46
12011 9 29425 52

   ... more lines ...

12133 11 1048 5
12133 12 1149 13
12133 13 1252 20
12133 14 896 33
12133 15 425 33
12133 16 92 27
;
data California;
   input FIPS Age Population Deaths;
   SamplingWeight=2;
   datalines;
6001 4 17412 348
6001 5 72709 58
6001 6 101367 41
6001 7 95572 33
6001 8 89730 87
6001 9 107173 124

   ... more lines ...

6115 11 5421 11
6115 12 3720 34
6115 13 2766 58
6115 14 1752 77
6115 15 796 74
6115 16 180 39
;

Table 3 describes the different levels of the categorical variable Age.

Table 3: Age Categories

Age Category	Description
4	Less than 1 year
5	1–4 years
6	5–9 years
7	10–14 years
8	15–19 years
9	20–24 years
10	25–34 years
11	35–44 years
12	45–54 years
13	55–64 years
14	65–74 years
15	75–84 years
16	85+ years

The following SAS statements use the SURVEYMEANS procedure to estimate the crude mortality rates for Florida and California. The RATE= option in the PROC SURVEYMEANS statement identifies the sampling rate. The SURVEYMEANS procedure uses the sampling rate to compute a finite population correction for the Taylor series variance estimates. The RATIO and SUM keywords in the PROC SURVEYMEANS statement request estimates of the population ratios and totals, respectively. The VAR statement requests estimates of the variables Deaths and Population. The CLUSTER statement specifies that the variable FIPS identify the primary sampling units. The WEIGHT statement specifies that the variable SamplingWeight contain the sampling weights. The RATIO statement identifies the ratio of interest to be the number of deaths divided by the population size.

proc surveymeans data=Florida ratio sum rate=.5;
  cluster fips;
  weight SamplingWeight;
  var deaths population;
  ratio 'Florida Crude Mortality Rate' deaths/population;
run;

proc surveymeans data=California ratio sum rate=.5;
  cluster fips;
  weight SamplingWeight;
  var deaths population;
  ratio 'California Crude Mortality Rate' deaths/population;
run;

Output 3 and Output 4 show the estimation results.

Output 3: Crude Mortality Rate for Florida

The SURVEYMEANS Procedure

Data Summary
Number of Clusters	34
Number of Observations	442
Sum of Weights	871

Ratio Analysis: Florida Crude Mortality Rate
Numerator	Denominator	Ratio	Std Err
Deaths	Population	0.010774	0.000464

Output 4: Crude Mortality Rate for California

The SURVEYMEANS Procedure

Data Summary
Number of Clusters	29
Number of Observations	377
Sum of Weights	754

Ratio Analysis: California Crude Mortality Rate
Numerator	Denominator	Ratio	Std Err
Deaths	Population	0.007702	0.000595

The estimated crude mortality rates for Florida and California are 1.08% and 0.77%, respectively. The ratio of the crude mortality rates is 1.40. However, before you conclude that the mortality rate is higher in Florida than in California, consider the following two exhibits. Figure 4 shows that the age-specific mortality rates are decidedly a function of age in both states.

Figure 4: Age-Specific Crude Rates versus Age in Florida and California

Figure 5 shows that the populations in Florida and California exhibit different age distributions. The percentage of residents in the age groups 13, 14, and 15 is higher in Florida than in California, whereas the percentage of residents in the age groups 5, 6, 7, 8, 9, 10, and 11 is lower in Florida than in California. Together these facts indicate that the crude mortality rates are not an appropriate measure for comparing differences between these two populations (Curtin and Klein, 1995).

Figure 5: Estimated Age Distributions in Florida and California

Note: The SAS statements that generate Figure 4 and Figure 5 are not shown here but are included in the downloadable SAS program that is available with this web example.

Because the crude rate is not appropriate, and because age-specific mortality rates provide too much detail and require a large number of comparisons, you can use a summary measure that controls for a population’s age distribution. A commonly used measure is the age-adjusted mortality rate, which you can compute by performing direct standardization (Curtin and Klein, 1995).

As mentioned earlier, direct standardization is mathematically equivalent to poststratification. The difference between poststratification for the purpose of performing direct standardization and other forms of poststratification is this: when you perform direct standardization, the poststratum totals or proportions represent a standard or reference population rather than the population from which your sample was drawn.

To compute comparable age-adjusted rates for Florida and California by using poststratification, you need a data set that contains the age distribution proportions from a standard or reference population. The following SAS statements create the data set USbyAge, which contains the age-specific proportions for the US population in 1968:

data USbyAge;
  input Age _PSPCT_;
  datalines;
 4 0.01755
 5 0.07291
 6 0.10231
 7 0.10202
 8 0.09116
 9 0.07545
10 0.11879
11 0.11822
12 0.11391
13 0.09065
14 0.06103
15 0.02980
16 0.00621
;

You can then use PROC SUVEYMEANS to compute age-adjusted mortality rates for Florida and California. The procedure specification in the following SAS statements is the same as when you compute the crude rates, except that you add a POSTSTRATA statement, which specifies poststratification on the variable Age, and the PSPCT= option, which specifies that the population proportions be contained in the data set USbyAge.

proc surveymeans data=Florida ratio rate=.5;
  cluster fips;
  weight SamplingWeight;
  var deaths population;
  poststrata age / pspct=USbyAge;
  ratio 'Florida Standardized Mortality Rate' deaths/population;
run;

proc surveymeans data=California ratio rate=.5;
  cluster fips;
  weight SamplingWeight;
  var deaths population;
  poststrata age / pspct=USbyAge;
  ratio 'California Standardized Mortality Rate' deaths/population;
run;

Output 5 and Output 6 show the estimation results. The age-adjusted mortality rates for Florida and California are 0.70% and 0.48%, respectively. The ratio of the age-adjusted mortality rates is 1.45. Therefore, on an age-adjusted basis, the mortality rate in Florida in 1968 is almost 1.5 times the mortality rate in California in the same year.

Output 5: Standardized Mortality Rate for Florida

The SURVEYMEANS Procedure

Data Summary
Number of Clusters	34
Number of Poststrata	13
Number of Observations	442
Sum of Weights	871

Ratio Analysis: Florida Standardized Mortality Rate
Numerator	Denominator	Ratio	Std Err
Deaths	Population	0.006952	0.000248

Output 6: Standardized Mortality Rate for California

The SURVEYMEANS Procedure

Data Summary
Number of Clusters	29
Number of Poststrata	13
Number of Observations	377
Sum of Weights	754

Ratio Analysis: California Standardized Mortality Rate
Numerator	Denominator	Ratio	Std Err
Deaths	Population	0.004791	0.000385

References

Curtin, L. R. and Klein, R. J. (1995), “Direct Standardization (Age-Adjusted Death Rates),” Healthy People 2000: Statistical Notes, DHHS Publication No. (PHS) 95-1237.
Ingram, D. D. and Franco, S. J. (2012), “NCHS Urban-Rural Classification Scheme for Counties,” Vital and Health Statistics, Series 2: Data Evaluation and Methods Research no. 154, DHHS publication no. (PHS) 2012-1354.
Lehtonen, R. and Pahkinen, E. (2004), Practical Methods for Design and Analysis of Complex Surveys, 2nd Edition, Chichester, UK: John Wiley & Sons.Lohr, S. L. (2010), Sampling: Design and Analysis, 2nd Edition, Boston: Brooks/Cole.
Särndal, C. E., Swensson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, New York: Springer-Verlag.