Melanie Dove, University of California, Davis; Katherine Heck, University of California, San Francisco
Population-based, representative surveys often incorporate complex methods in data collection, such as oversampling, weighting, stratification or clustering. If survey procedures are not used in data analyses, results will provide incorrect estimates and may overstate the results of significance testing. SAS procedures, such as PROC SURVEYFREQ and PROC SURVEYMEANS, make it easy to adjust for the complex sample design and weighting of representative surveys to obtain the correct percentages, confidence intervals, means, odds ratios and other statistics from complex survey data. In this introductory presentation, attendees will learn why it is necessary to use survey procedures when analyzing stratified or cluster sample surveys, what key survey design features are incorporated in the SAS code, and how to generate estimates using SAS survey procedures. This presentation will provide sample survey procedure code, and explain how to interpret the output from each survey procedure. Using examples from two publicly available surveys with different design elements (the National Health and Nutrition Examination Survey and the California Health Interview Survey), this presentation will demonstrate the following SAS survey procedures: PROC SURVEYFREQ, PROC SURVEYMEANS, PROC SURVEYLOGISTIC and PROC SURVEYREG.
Watch Survey Data Analysis Made Easy With SAS® as presented by the author on the SAS Users YouTube channel.
This paper will describe key survey design features, including sampling and weighting, and illustrate how to use SAS survey procedures to adjust for these features. Two examples of publicly available health surveys will be used in this paper, the National Health and Nutrition Examination Survey (NHANES) and the California Health Interview Survey (CHIS). Briefly, NHANES is a national in-person survey and CHIS is California-based telephone survey. After discussing key survey design features, sample SAS code and output will be reviewed.
A survey is a sample of individuals that is selected to represent a population. Often, there are not enough resources to collect information on the entire population of interest, so a sample of the population is selected to represent the overall population. Survey designs may incorporate information about how participants in the sample were selected, including stratification or clustering, as well as including weights that help to make the sample representative of the target population. SAS survey procedures are used to adjust for these survey design features in order to provide accurate results.
Sampling indicates the method used to select participants from a target population. For example, NHANES selects approximately 5,000 participants per year from the entire US population and CHIS selects about 26,000 participants per year from the California population to be in the survey. This paper will discuss three different types of sampling: simple random sampling, stratified sampling, and cluster sampling.
With this type of sample, participants are selected at random from the target population and are considered independent from each other. No adjustment in SAS is needed for this type of sampling; analyses may use PROC FREQ and other non-survey procedures.
With stratified sampling, the target population is divided into different groups (or strata) and participants are sampled from each group. This ensures that there are enough people in the sample from each group. For example, in CHIS, the target population (California) is divided into different geographic regions (or strata) and participants are randomly selected from each region. This ensures that there are participants from all areas of California in the sample. With stratified sampling, participants are not independent from each other, and the STRATA statement is used to adjust for the fact that individuals within strata are more similar to each other compared with individuals from different strata. Not adjusting for this fact will lead to an overestimate of the variance and significance of findings.
With cluster sampling, the target population is also divided into different groups, based on a shared characteristic, such as the school they attend or the county in which they live. The group itself (the cluster) is selected randomly, and then participants are drawn from that cluster. For example, in NHANES, counties are selected to be in the survey from a list of counties in the target population (United States), and participants are selected from each county or cluster. Similar to stratified sampling, participants are not independent of each other. The CLUSTER statement is used to adjust for the fact that individuals within clusters are more similar to each other compared with individuals from different clusters.
A weight is a value indicating the number of people the respondent represents. Each person is given one or more weights so that the weighted estimates are representative of the target population. The SAS statement WEIGHT corrects for the following factors:
This paper will describe two methods of creating weights: single weights and replicate weights.
With single weights there is one weight per person. The weight represents the number of individuals in the population that the sampled person represents.
With replicate weights, there is more than one weight per person. Often, each participant receives a base weight and an additional 80 replicate weights. Replicate weights may be used when there is a concern about confidentiality, since releasing a stratification variable may reveal information about participants. For example, in CHIS the target population is divided into different strata based on geography. Instead of releasing a strata variable with information on each participant’s geographic location, they use replicate weights to account for both the stratification and weight. With replicate weights, two SAS statements are used: WEIGHT to list the base weight and REPWEIGHT to list the replicate weights.
All SAS survey procedures use common statements to identify survey design and weighting elements.
Additional statements are similar to those used in non-survey SAS procedures, such as TABLES, CLASS, CONTRAST, VAR, MODEL, and TEST; see SAS online documentation for options available in each procedure.
To demonstrate how to analyze survey data with SAS, this paper will use examples from NHANES and CHIS, which have different sampling and weighting methods, as outlined in Table 1 below.
Survey |
Sampling method |
Weight method |
SAS statements |
NHANES |
Stratification and clustering |
Single weights |
STRATA CLUSTER WEIGHT |
CHIS |
Stratification |
Replicate weights |
WEIGHT REPWEIGHT |
Table 1. Example Surveys, Sampling and Weight Methods, and SAS Statements
Below are two general syntax examples for how to adjust for the survey design features, using NHANES and CHIS data.
proc surveyfreq data=datasetname varmethod=taylor;
strata stratavariable;
cluster clustervariable;
weight weightvariable;
tables variablename;
run;
proc surveyfreq data=datasetname varmethod=jackknife;
weight baseweightvariable;
repweight replicateweightvariables / jkcoefs=1;
tables variablename;
run;
Below are more specific examples of SAS code and output for four different survey procedures: 1) PROC SURVEYFREQ, 2) PROC SURVEYMEANS, 3) PROC SURVEYLOGISTIC, and 4) PROC SURVEYREG. The strata, cluster, and weight variables below may change when using different years of data. Make sure to consult each survey’s documentation for the correct variables to include in the code.
In the following example, CHIS data is used to estimate the percent of men and women (‘srsex’) with hypertension (‘ab29’). The DOMAIN statement is not available with the SURVEYFREQ procedure. Therefore, the gender variable is included as the first variable in the cross tab to get estimates by gender. Do not subset the data using a WHERE or BY statement. The following options are specified on the tables statement:
proc surveyfreq data=CHIS varmethod=jackknife;
weight rakedw0;
repweight rakedw1-rakedw80 / jkcoefs=1;
tables srsex*ab29 / cl row nototal chisq;
run;
Output 1. Output from a PROC SURVEYFREQ Statement
From Output 1, 31.0% of men and 25.8% of men had hypertension. These estimates are statistically different as indicated by the chi-square p-value of 0.0006.
The following example requests the mean number of times walked for leisure in the past week (‘ad41W’) for different age categories (‘srage_p1’), also using CHIS data. The continuous variable ‘number of times walked’ is included on the VAR statement. The DOMAIN statement is used for the categorical age variable.
proc surveymeans data=CHIS varmethod=jackknife;
weight rakedw0;
repweight rakedw1-rakedw80 / jkcoefs=1;
var ad41W;
domain SRAGE_P1;
run;
Output 2. Output from a PROC SURVEYMEANS Statement
From Output 2, the mean number of times walked for leisure in the past seven days was 3.0 times for 18-29 year-olds and 2.5 times for adults 70 years and older.
The following example examines the association between insurance status (‘uninsured’) and not having a usual source of healthcare (‘nousual’). The CLASS statement is used for the categorical variable ‘uninsured’ and the ‘Insured’ category is specified as the referent. On the model statement, the ‘(descending)’ option is used so that SAS estimates the odds that ‘nousual’ equals 1 (or does not have a usual source of healthcare) instead of 0 (has a usual source of healthcare).
proc surveylogistic data=CHIS varmethod=jackknife;
weight rakedw0;
cluster sdmvpsu;
repweight rakedw1-rakedw80 / jkcoefs=1;
class uninsured (ref='Insured’)/ param=ref;
model nousual (descending)=uninsured;
run;
Output 3. Output from a PROC SURVEYLOGISTIC Statement
From Output 3, the odds ratio is 5.2 with a 95% confidence interval of 3.7 to 7.3, suggesting that uninsured adults are 5 times as likely to not have a usual source of healthcare, compared with insured adults.
Using the NHANES data, the following example examines the association between health insurance status (‘hi’) and cotinine (‘lbxcot’), a biomarker of nicotine measured in ng/mL. The DOMAIN statement is used with the variable ‘set’, a flag variable that equals 1 if in the analysis group and 0 otherwise. On the model statement, the option SOLUTION is included to get the beta estimates and the option CLPARM is used to get the 95% confidence intervals.
proc surveyreg data=NHANES varmethod=taylor;
strata sdmvstra;
cluster sdmvpsu;
weight wtint2yr;
domain set;
class hi;
model lbxcot=hi / solution clparm;
run;
Output 4. Output from a PROC SURVEYREG Statement
From Output 4, the average cotinine level for adults with Medicaid insurance is 28.2 ng/mL higher than uninsured adults. The average cotinine level for adults with private insurance is 38.6 ng/mL lower than uninsured adults.
The SAS survey procedures provide a flexible way to obtain accurate results of analyses using stratified or clustered survey data.
Centers for Disease Control and Prevention, National Center for Health Statistics. “NHANES Questionnaires, Datasets, and Related Documentation.” Accessed April 14, 2021. https://wwwn.cdc.gov/nchs/nhanes/Default.aspx.
UCLA Center for Health Policy Research, California Health Interview Survey. “Public Use Data.” Accessed April 14, 2021. https://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx
Your comments and questions are valued and encouraged. Contact the authors at:
Melanie Dove
University of California, Davis
mdove@ucdavis.edu
Katherine Heck
University of California, San Francisco
katherine.heck@ucsf.edu
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Select SAS Training centers are offering in-person courses. View upcoming courses for: