Hello,
I am working with data collected through a survey that was implemented in a number of different countries. I have limited info on the survey's design so I am working with what I have and know (which is not much). I have sampling weights, which adjust for over- and under-sampling on a number of key demographics; their purpose is to make the sample representative of the respective countries' populations.
I have a Likert scale item that has missing values that I am trying to impute using proc surveyimpute. Below is my syntax, but I don't fully understand the ndonor and cells bit, so I want to make sure I am not doing anything stupid and inappropriate. What I 'imagine' the cells statement is doing is ensuring that if a missing value is for a respondent from Belgium, for example, then the donors will come from Belgium and not other countries. And what I 'imagine' the ndonors is doing is telling SAS to randomly picking 5 values from which the imputed value will be picked ? Please correct any misperception I have here. The official SAS documentation is too complex for me to understand.
proc surveyimpute data=data method=hotdeck(selection=abb)
ndonors=5 seed=1234;
weight weight;
cells country;
var var1;
output out=imputed;
run;
Your intuition is essentially correct. Let’s break down what each key option is doing:
CELLS statement (cells country;
) :
This statement defines the imputation cells based on the variable country
. In practical terms, it means that the procedure will perform imputation within each country. So, if an observation from Belgium is missing a value, SAS will only consider donor observations that are also from Belgium. This is important for maintaining the within-country characteristics of the data.
NDONORS option (ndonors=5
) :
This option tells SAS how many potential donor observations to select from the specified cell for each missing value. In your example, SAS will randomly choose up to 5 candidate donors (from within the same country) for each missing value. Then, depending on the imputation method and settings, one of these candidate donors will be used to impute the missing value. In a hot deck imputation, typically one donor is ultimately selected at random from this pool to supply the imputed value.
Additional notes:
Method and selection:
You’re using the hotdeck
method with selection=abb
. The abb
option specifies the algorithm SAS uses to select a donor from the available candidates. Although the details of abb
can be technical, its role is simply to govern how the donor is chosen among those available (the NDONORS candidates).
Seed option (seed=1234
) :
This ensures that the random selection of donors is reproducible. Every time you run the code with that seed, you should get the same set of imputed values.
In summary, your understanding is on target:
Just be sure that there are enough donor cases within each cell (country) so that the NDONORS=5 option can work as intended. If a country has very few respondents, SAS may not be able to find 5 donors, and it will use as many as are available.
Hope this clarifies your questions!
Your intuition is essentially correct. Let’s break down what each key option is doing:
CELLS statement (cells country;
) :
This statement defines the imputation cells based on the variable country
. In practical terms, it means that the procedure will perform imputation within each country. So, if an observation from Belgium is missing a value, SAS will only consider donor observations that are also from Belgium. This is important for maintaining the within-country characteristics of the data.
NDONORS option (ndonors=5
) :
This option tells SAS how many potential donor observations to select from the specified cell for each missing value. In your example, SAS will randomly choose up to 5 candidate donors (from within the same country) for each missing value. Then, depending on the imputation method and settings, one of these candidate donors will be used to impute the missing value. In a hot deck imputation, typically one donor is ultimately selected at random from this pool to supply the imputed value.
Additional notes:
Method and selection:
You’re using the hotdeck
method with selection=abb
. The abb
option specifies the algorithm SAS uses to select a donor from the available candidates. Although the details of abb
can be technical, its role is simply to govern how the donor is chosen among those available (the NDONORS candidates).
Seed option (seed=1234
) :
This ensures that the random selection of donors is reproducible. Every time you run the code with that seed, you should get the same set of imputed values.
In summary, your understanding is on target:
Just be sure that there are enough donor cases within each cell (country) so that the NDONORS=5 option can work as intended. If a country has very few respondents, SAS may not be able to find 5 donors, and it will use as many as are available.
Hope this clarifies your questions!
You do not need to manually split the imputed dataset into five separate datasets. SAS automatically generates multiple rows (one for each imputation) using the imputation indicator variable (_Imputation_
). The correct approach is to use this indicator variable to run your analysis separately for each imputed dataset (for example, by using a BY _Imputation_
statement in your analysis procedure) and then pool the results with PROC MIANALYZE. This method properly accounts for both the within-imputation variability and the between-imputation variability, following Rubin's rules.
/*-----------------------------------------------------*/
/* Step 1: Create a Synthetic Dataset with Missing Data */
/*-----------------------------------------------------*/
data dummy;
/* Initialize the random number generator for reproducibility */
call streaminit(12345);
do id = 1 to 500;
/* Simulate a categorical variable 'country' with 4 countries */
country = ceil(4 * rand("Uniform"));
/* Simulate a weight variable (always positive) */
weight = 1 + 2 * rand("Uniform");
/* Generate an underlying Likert scale value (1 to 5) */
var1_true = ceil(5 * rand("Uniform"));
/* Create an outcome variable that depends on var1_true plus some noise */
outcome = 10 + 2 * var1_true + rand("Normal");
/* Introduce missingness in var1 at about 20% probability */
if rand("Uniform") < 0.2 then
var1 = .;
else
var1 = var1_true;
output;
end;
drop var1_true id;
run;
/* Create a format to display country names */
proc format;
value countryfmt
1 = 'Belgium'
2 = 'France'
3 = 'Germany'
4 = 'Netherlands';
run;
/* Apply the format to the country variable */
data dummy;
set dummy;
format country countryfmt.;
run;
/*-----------------------------------------------------*/
/* Step 2: Impute Missing Values using PROC SURVEYIMPUTE */
/*-----------------------------------------------------*/
/* Use the hot deck method with 5 candidate donors.
The IMPINDEX= option creates an imputation indicator variable. */
proc surveyimpute data=dummy method=hotdeck(selection=abb)
ndonors=5 seed=1234;
weight weight;
cells country;
var var1;
output out=imputed impindex=_Imputation_;
run;
/* Optional: Check the contents of the imputed dataset to verify _Imputation_ exists */
proc contents data=imputed;
run;
/*-----------------------------------------------------*/
/* Step 3: Analyze Each Imputed Dataset Separately */
/*-----------------------------------------------------*/
/* Sort the imputed dataset by the imputation indicator variable */
proc sort data=imputed;
by _Imputation_;
run;
/* Run a survey regression for each imputation */
proc surveyreg data=imputed;
by _Imputation_;
weight weight;
model outcome = var1;
ods output ParameterEstimates=est;
run;
/* Ensure that the estimates dataset is sorted by the parameter indicator */
proc sort data=est;
by Parameter;
run;
/*-----------------------------------------------------*/
/* Step 4: Pool the Results Using PROC MIANALYZE */
/*-----------------------------------------------------*/
/* Pool the regression estimates across imputations according to Rubin’s rules */
proc mianalyze data=est;
by Parameter;
modeleffects Estimate;
stderr StdErr;
run;
This example demonstrates a complete workflow for handling missing data with multiple imputations in SAS. The code performs the following steps:
Data Generation:
country
), a weight variable, a Likert-scale variable (var1
) with about 20% missing values, and an outcome variable (outcome
).Imputation:
var1
.cells
statement ensures that imputation is done separately by country
, and the ndonors=5
option selects five candidate donors per missing value.IMPINDEX=_Imputation_
option creates an indicator variable (_Imputation_
) that marks each of the five imputations.Analysis by Imputation:
_Imputation_
and runs a survey regression (PROC SURVEYREG) separately for each imputation using a BY _Imputation_
statement.est
).Pooling Results:
Parameter
variable, and then pooled by processing each parameter separately.something ain't right with this. if I do the analysis with by _imputation_, then it only runs the model in the imputed values in this case, only 1000 obs as opposed to 4000 which is the total dataset (obs with non-missing n=3000 + obs with missing n=1000)
ChatGPT is often wrong. I think you used chatGPT
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.