Statistical Procedures

K_S · Posted 01-30-2025 09:10 AM

Hello,

I am working with data collected through a survey that was implemented in a number of different countries. I have limited info on the survey's design so I am working with what I have and know (which is not much). I have sampling weights, which adjust for over- and under-sampling on a number of key demographics; their purpose is to make the sample representative of the respective countries' populations.

I have a Likert scale item that has missing values that I am trying to impute using proc surveyimpute. Below is my syntax, but I don't fully understand the ndonor and cells bit, so I want to make sure I am not doing anything stupid and inappropriate. What I 'imagine' the cells statement is doing is ensuring that if a missing value is for a respondent from Belgium, for example, then the donors will come from Belgium and not other countries. And what I 'imagine' the ndonors is doing is telling SAS to randomly picking 5 values from which the imputed value will be picked ? Please correct any misperception I have here. The official SAS documentation is too complex for me to understand.

proc surveyimpute data=data method=hotdeck(selection=abb)
ndonors=5 seed=1234;
weight weight;
cells country;
var var1;
output out=imputed;
run;

webart999ARM · Posted 02-02-2025 06:18 AM

Your intuition is essentially correct. Let’s break down what each key option is doing:

CELLS statement (cells country;) :
This statement defines the imputation cells based on the variable country. In practical terms, it means that the procedure will perform imputation within each country. So, if an observation from Belgium is missing a value, SAS will only consider donor observations that are also from Belgium. This is important for maintaining the within-country characteristics of the data.
NDONORS option (ndonors=5) :
This option tells SAS how many potential donor observations to select from the specified cell for each missing value. In your example, SAS will randomly choose up to 5 candidate donors (from within the same country) for each missing value. Then, depending on the imputation method and settings, one of these candidate donors will be used to impute the missing value. In a hot deck imputation, typically one donor is ultimately selected at random from this pool to supply the imputed value.
Additional notes:
- Method and selection:
  You’re using the hotdeck method with selection=abb. The abb option specifies the algorithm SAS uses to select a donor from the available candidates. Although the details of abb can be technical, its role is simply to govern how the donor is chosen among those available (the NDONORS candidates).
- Seed option (seed=1234) :
  This ensures that the random selection of donors is reproducible. Every time you run the code with that seed, you should get the same set of imputed values.

In summary, your understanding is on target:

CELLS ensures that imputation is carried out within the same country (or group defined by the variable).
NDONORS=5 means that for each missing observation, up to 5 donor cases will be randomly selected from that cell, and then one of them will provide the imputed value.

Just be sure that there are enough donor cases within each cell (country) so that the NDONORS=5 option can work as intended. If a country has very few respondents, SAS may not be able to find 5 donors, and it will use as many as are available.

Hope this clarifies your questions!

View solution in original post

webart999ARM · Posted 02-02-2025 06:18 AM

Your intuition is essentially correct. Let’s break down what each key option is doing:

CELLS statement (cells country;) :
This statement defines the imputation cells based on the variable country. In practical terms, it means that the procedure will perform imputation within each country. So, if an observation from Belgium is missing a value, SAS will only consider donor observations that are also from Belgium. This is important for maintaining the within-country characteristics of the data.
NDONORS option (ndonors=5) :
This option tells SAS how many potential donor observations to select from the specified cell for each missing value. In your example, SAS will randomly choose up to 5 candidate donors (from within the same country) for each missing value. Then, depending on the imputation method and settings, one of these candidate donors will be used to impute the missing value. In a hot deck imputation, typically one donor is ultimately selected at random from this pool to supply the imputed value.
Additional notes:
- Method and selection:
  You’re using the hotdeck method with selection=abb. The abb option specifies the algorithm SAS uses to select a donor from the available candidates. Although the details of abb can be technical, its role is simply to govern how the donor is chosen among those available (the NDONORS candidates).
- Seed option (seed=1234) :
  This ensures that the random selection of donors is reproducible. Every time you run the code with that seed, you should get the same set of imputed values.

In summary, your understanding is on target:

CELLS ensures that imputation is carried out within the same country (or group defined by the variable).
NDONORS=5 means that for each missing observation, up to 5 donor cases will be randomly selected from that cell, and then one of them will provide the imputed value.

Just be sure that there are enough donor cases within each cell (country) so that the NDONORS=5 option can work as intended. If a country has very few respondents, SAS may not be able to find 5 donors, and it will use as many as are available.

Hope this clarifies your questions!

K_S · Posted 02-02-2025 09:20 AM

Thank you so much for your response—I really appreciate it. I also noticed that after imputation, SAS creates a data set with the imputed values, and for each observation that was imputed, there are five rows.

Would the correct approach be to separate these into five different data sets and then conduct the analysis five times? Since SAS seems to generate five imputations for each observation with a missing value, I wanted to confirm whether the appropriate method is to create separate data sets for each imputation before proceeding with the analysis.

I appreciate your guidance on this. Thank you again for your help!

webart999ARM · Posted 02-02-2025 10:06 AM

You do not need to manually split the imputed dataset into five separate datasets. SAS automatically generates multiple rows (one for each imputation) using the imputation indicator variable (_Imputation_). The correct approach is to use this indicator variable to run your analysis separately for each imputed dataset (for example, by using a BY _Imputation_ statement in your analysis procedure) and then pool the results with PROC MIANALYZE. This method properly accounts for both the within-imputation variability and the between-imputation variability, following Rubin's rules.

/*-----------------------------------------------------*/
/* Step 1: Create a Synthetic Dataset with Missing Data */
/*-----------------------------------------------------*/
data dummy;
   /* Initialize the random number generator for reproducibility */
   call streaminit(12345);
   do id = 1 to 500;
      /* Simulate a categorical variable 'country' with 4 countries */
      country = ceil(4 * rand("Uniform"));
      
      /* Simulate a weight variable (always positive) */
      weight = 1 + 2 * rand("Uniform");
      
      /* Generate an underlying Likert scale value (1 to 5) */
      var1_true = ceil(5 * rand("Uniform"));
      
      /* Create an outcome variable that depends on var1_true plus some noise */
      outcome = 10 + 2 * var1_true + rand("Normal");
      
      /* Introduce missingness in var1 at about 20% probability */
      if rand("Uniform") < 0.2 then 
         var1 = .;
      else 
         var1 = var1_true;
      
      output;
   end;
   drop var1_true id;
run;

/* Create a format to display country names */
proc format;
   value countryfmt
       1 = 'Belgium'
       2 = 'France'
       3 = 'Germany'
       4 = 'Netherlands';
run;

/* Apply the format to the country variable */
data dummy;
   set dummy;
   format country countryfmt.;
run;

/*-----------------------------------------------------*/
/* Step 2: Impute Missing Values using PROC SURVEYIMPUTE  */
/*-----------------------------------------------------*/
/* Use the hot deck method with 5 candidate donors.
   The IMPINDEX= option creates an imputation indicator variable. */
proc surveyimpute data=dummy method=hotdeck(selection=abb)
    ndonors=5 seed=1234;
    weight weight;
    cells country;
    var var1;
    output out=imputed impindex=_Imputation_;
run;

/* Optional: Check the contents of the imputed dataset to verify _Imputation_ exists */
proc contents data=imputed; 
run;

/*-----------------------------------------------------*/
/* Step 3: Analyze Each Imputed Dataset Separately      */
/*-----------------------------------------------------*/
/* Sort the imputed dataset by the imputation indicator variable */
proc sort data=imputed;
   by _Imputation_;
run;

/* Run a survey regression for each imputation */
proc surveyreg data=imputed;
   by _Imputation_;
   weight weight;
   model outcome = var1;
   ods output ParameterEstimates=est;
run;

/* Ensure that the estimates dataset is sorted by the parameter indicator */
proc sort data=est;
   by Parameter;
run;

/*-----------------------------------------------------*/
/* Step 4: Pool the Results Using PROC MIANALYZE        */
/*-----------------------------------------------------*/
/* Pool the regression estimates across imputations according to Rubin’s rules */
proc mianalyze data=est;
   by Parameter;
   modeleffects Estimate;
   stderr StdErr;
run;

This example demonstrates a complete workflow for handling missing data with multiple imputations in SAS. The code performs the following steps:

Data Generation:
- Creates a synthetic dataset with 500 observations, including a categorical variable (country), a weight variable, a Likert-scale variable (var1) with about 20% missing values, and an outcome variable (outcome).
Imputation:
- Uses PROC SURVEYIMPUTE with the hot deck method to impute missing values in var1.
- The cells statement ensures that imputation is done separately by country, and the ndonors=5 option selects five candidate donors per missing value.
- The IMPINDEX=_Imputation_ option creates an indicator variable (_Imputation_) that marks each of the five imputations.
Analysis by Imputation:
- Sorts the imputed data by _Imputation_ and runs a survey regression (PROC SURVEYREG) separately for each imputation using a BY _Imputation_ statement.
- The regression estimates (including standard errors) are output to a dataset (est).
Pooling Results:
- The estimates are then pooled using PROC MIANALYZE.
- Since the ODS output from PROC SURVEYREG is in long format (with one record per parameter per imputation), the data are first sorted by the Parameter variable, and then pooled by processing each parameter separately.

K_S · Posted 02-03-2025 09:15 AM

something ain't right with this. if I do the analysis with by _imputation_, then it only runs the model in the imputed values in this case, only 1000 obs as opposed to 4000 which is the total dataset (obs with non-missing n=3000 + obs with missing n=1000)

ChatGPT is often wrong. I think you used chatGPT

Statistical Procedures

PROC SURVEYIMPUTE VERIFICATION

Re: PROC SURVEYIMPUTE VERIFICATION

Re: PROC SURVEYIMPUTE VERIFICATION

Re: PROC SURVEYIMPUTE VERIFICATION

Re: PROC SURVEYIMPUTE VERIFICATION

Re: PROC SURVEYIMPUTE VERIFICATION

PROC REPORT

PROC MIANALYZE after PROC SURVEYIMPUTE?

Surveyimpute

Proc transpose help

Question about PROC LCA resource

Follow Us

What is...

Statistical Procedures

Register Today!

Follow Us

What is...