Sample weights, strata, and cluster for complex surveys

Abrefah · Posted 12-03-2024 05:19 AM

I am analyzing data using HINTS 5 cycles 3 and 4, which I stacked to get 9,303 observations. Weight and variance for the cycles according to the documentation were:

Cycle 4:

strata VAR_STRATUM;

cluster VAR_CLUSTER;

weight PERSON_FINWT0;

Cycle 3

strata VAR_STRATUM;

cluster VAR_CLUSTER;

weight NWGT0;

Because the weight variables were different for each cycle, I created a common weighted variable for the two (final_weight)

Also, there was no unique identifier for the HINTS data, so I created one using the household IDs and then I merged the datasets.

Below is how I applied the code up to this point:

/* Step 1: Recode HeardHPVVaccine2 and Gender and Include Other Variables for Cycle 3 */

data cycle3_recoded;

set tmp1.hints5_cycle3_public;

/* Recode Gender */

if GenderC = 1 then Gender = 1; /* Male */

else if GenderC = 2 then Gender = 2; /* Female */

else Gender = .; /* Missing values */

/* Recode HeardHPVVaccine2 */

if SEEKCAN = 1 then HeardHPVVaccine2 = 1; /* Yes */

else if SEEKCAN = 2 then HeardHPVVaccine2 = 2; /* No */

else if SEEKCAN in (-9, -7) then HeardHPVVaccine2 = .; /* Missing values */

/* Rename weight variable */

final_weight = NWGT0;

/* Add survey cycle identifier */

surveycycle = 3;

run;

/* Step 2: Recode HeardHPVVaccine2 and Gender and Include Other Variables for Cycle 4 */

data cycle4_recoded;

set tmp2.hints5_cycle4_public;

/* Recode Gender */

if SelfGender = 1 then Gender = 1; /* Male */

else if SelfGender = 2 then Gender = 2; /* Female */

else Gender = .; /* Missing values */

/* Recode HeardHPVVaccine2 */

if ELECTRO = 1 then HeardHPVVaccine2 = 1; /* Yes */

else if ELECTRO = 2 then HeardHPVVaccine2 = 2; /* No */

else if ELECTRO = -9 then HeardHPVVaccine2 = .; /* Missing values */

/* Rename weight variable */

final_weight = PERSON_FINWT0;

/* Add survey cycle identifier */

surveycycle = 4;

run;

/* Step 3: Stack the Two Cycles */

data stacked;

set cycle3_recoded cycle4_recoded;

newID = CATS(HHID, surveycycle);

run;

/* Step 5: Extract Variables Needed for the Research Question */

data selected;

set stacked;

keep RaceEthn5 AgeGrpA EducA MaritalStatus HeardHPV HPVCauseCancer_Cervical HeardHPVVaccine2 ExplainedClearly SpentEnoughTime Gender final_weight VAR_STRATUM VAR_CLUSTER;

run;

After going through all the above, I decided to use a means procedure to look at the combined sampling weight (final_weight), with this code:

proc means data=recode1 n min mean max sum;
var final_weight;
run;

and I realized it was overly high (1010026682), greater than the US population (335,893,238). Will this be the same with the strata and cluster variables?

What should I do?

ballardw · Posted 12-04-2024 01:13 PM

I am not familiar with HINTS data in any form. Likely each cycle was weighted to a population total then I would expect most methods of renaming variables and combining to have a "population" estimate of roughly N times the population where N is the number of "cycles" assuming no extreme changes in the population between each cycle.

One basic approach is to scale each cycle to a proportion of a chosen population total.

With BRFSS data, a large scale complex survey I have worked with, one approach is:

To combine multiple years of Behavioral Risk Factor Surveillance System (BRFSS) data, 
you can adjust the weight variable proportionally based on the sample sizes for each year:

    Determine the sample size for each year
    Add the sample sizes together
    Calculate the proportion for each year by dividing the sample size for that year by the total sample size
    Adjust the weight for each year by multiplying the original weight by the proportion for that year

Sample weights, strata, and cluster for complex surveys

Re: Sample weights, strata, and cluster for complex surveys

SAS Innovate 2026 Registration is Open