- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am working with data collected through a survey that was implemented in a number of different countries. I have limited info on the survey's design so I am working with what I have and know (which is not much). I have sampling weights, which adjust for over- and under-sampling on a number of key demographics; their purpose is to make the sample representative of the respective countries' populations.
I have a Likert scale item that has missing values that I am trying to impute using proc surveyimpute. Below is my syntax, but I don't fully understand the ndonor and cells bit, so I want to make sure I am not doing anything stupid and inappropriate. What I 'imagine' the cells statement is doing is ensuring that if a missing value is for a respondent from Belgium, for example, then the donors will come from Belgium and not other countries. And what I 'imagine' the ndonors is doing is telling SAS to randomly picking 5 values from which the imputed value will be picked ? Please correct any misperception I have here. The official SAS documentation is too complex for me to understand.
proc surveyimpute data=data method=hotdeck(selection=abb)
ndonors=5 seed=1234;
weight weight;
cells country;
var var1;
output out=imputed;
run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your intuition is essentially correct. Let’s break down what each key option is doing:
-
CELLS statement (
cells country;
) :
This statement defines the imputation cells based on the variablecountry
. In practical terms, it means that the procedure will perform imputation within each country. So, if an observation from Belgium is missing a value, SAS will only consider donor observations that are also from Belgium. This is important for maintaining the within-country characteristics of the data. -
NDONORS option (
ndonors=5
) :
This option tells SAS how many potential donor observations to select from the specified cell for each missing value. In your example, SAS will randomly choose up to 5 candidate donors (from within the same country) for each missing value. Then, depending on the imputation method and settings, one of these candidate donors will be used to impute the missing value. In a hot deck imputation, typically one donor is ultimately selected at random from this pool to supply the imputed value. -
Additional notes:
-
Method and selection:
You’re using thehotdeck
method withselection=abb
. Theabb
option specifies the algorithm SAS uses to select a donor from the available candidates. Although the details ofabb
can be technical, its role is simply to govern how the donor is chosen among those available (the NDONORS candidates). -
Seed option (
seed=1234
) :
This ensures that the random selection of donors is reproducible. Every time you run the code with that seed, you should get the same set of imputed values.
-
In summary, your understanding is on target:
- CELLS ensures that imputation is carried out within the same country (or group defined by the variable).
- NDONORS=5 means that for each missing observation, up to 5 donor cases will be randomly selected from that cell, and then one of them will provide the imputed value.
Just be sure that there are enough donor cases within each cell (country) so that the NDONORS=5 option can work as intended. If a country has very few respondents, SAS may not be able to find 5 donors, and it will use as many as are available.
Hope this clarifies your questions!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your intuition is essentially correct. Let’s break down what each key option is doing:
-
CELLS statement (
cells country;
) :
This statement defines the imputation cells based on the variablecountry
. In practical terms, it means that the procedure will perform imputation within each country. So, if an observation from Belgium is missing a value, SAS will only consider donor observations that are also from Belgium. This is important for maintaining the within-country characteristics of the data. -
NDONORS option (
ndonors=5
) :
This option tells SAS how many potential donor observations to select from the specified cell for each missing value. In your example, SAS will randomly choose up to 5 candidate donors (from within the same country) for each missing value. Then, depending on the imputation method and settings, one of these candidate donors will be used to impute the missing value. In a hot deck imputation, typically one donor is ultimately selected at random from this pool to supply the imputed value. -
Additional notes:
-
Method and selection:
You’re using thehotdeck
method withselection=abb
. Theabb
option specifies the algorithm SAS uses to select a donor from the available candidates. Although the details ofabb
can be technical, its role is simply to govern how the donor is chosen among those available (the NDONORS candidates). -
Seed option (
seed=1234
) :
This ensures that the random selection of donors is reproducible. Every time you run the code with that seed, you should get the same set of imputed values.
-
In summary, your understanding is on target:
- CELLS ensures that imputation is carried out within the same country (or group defined by the variable).
- NDONORS=5 means that for each missing observation, up to 5 donor cases will be randomly selected from that cell, and then one of them will provide the imputed value.
Just be sure that there are enough donor cases within each cell (country) so that the NDONORS=5 option can work as intended. If a country has very few respondents, SAS may not be able to find 5 donors, and it will use as many as are available.
Hope this clarifies your questions!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Would the correct approach be to separate these into five different data sets and then conduct the analysis five times? Since SAS seems to generate five imputations for each observation with a missing value, I wanted to confirm whether the appropriate method is to create separate data sets for each imputation before proceeding with the analysis.
I appreciate your guidance on this. Thank you again for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You do not need to manually split the imputed dataset into five separate datasets. SAS automatically generates multiple rows (one for each imputation) using the imputation indicator variable (_Imputation_
). The correct approach is to use this indicator variable to run your analysis separately for each imputed dataset (for example, by using a BY _Imputation_
statement in your analysis procedure) and then pool the results with PROC MIANALYZE. This method properly accounts for both the within-imputation variability and the between-imputation variability, following Rubin's rules.
/*-----------------------------------------------------*/
/* Step 1: Create a Synthetic Dataset with Missing Data */
/*-----------------------------------------------------*/
data dummy;
/* Initialize the random number generator for reproducibility */
call streaminit(12345);
do id = 1 to 500;
/* Simulate a categorical variable 'country' with 4 countries */
country = ceil(4 * rand("Uniform"));
/* Simulate a weight variable (always positive) */
weight = 1 + 2 * rand("Uniform");
/* Generate an underlying Likert scale value (1 to 5) */
var1_true = ceil(5 * rand("Uniform"));
/* Create an outcome variable that depends on var1_true plus some noise */
outcome = 10 + 2 * var1_true + rand("Normal");
/* Introduce missingness in var1 at about 20% probability */
if rand("Uniform") < 0.2 then
var1 = .;
else
var1 = var1_true;
output;
end;
drop var1_true id;
run;
/* Create a format to display country names */
proc format;
value countryfmt
1 = 'Belgium'
2 = 'France'
3 = 'Germany'
4 = 'Netherlands';
run;
/* Apply the format to the country variable */
data dummy;
set dummy;
format country countryfmt.;
run;
/*-----------------------------------------------------*/
/* Step 2: Impute Missing Values using PROC SURVEYIMPUTE */
/*-----------------------------------------------------*/
/* Use the hot deck method with 5 candidate donors.
The IMPINDEX= option creates an imputation indicator variable. */
proc surveyimpute data=dummy method=hotdeck(selection=abb)
ndonors=5 seed=1234;
weight weight;
cells country;
var var1;
output out=imputed impindex=_Imputation_;
run;
/* Optional: Check the contents of the imputed dataset to verify _Imputation_ exists */
proc contents data=imputed;
run;
/*-----------------------------------------------------*/
/* Step 3: Analyze Each Imputed Dataset Separately */
/*-----------------------------------------------------*/
/* Sort the imputed dataset by the imputation indicator variable */
proc sort data=imputed;
by _Imputation_;
run;
/* Run a survey regression for each imputation */
proc surveyreg data=imputed;
by _Imputation_;
weight weight;
model outcome = var1;
ods output ParameterEstimates=est;
run;
/* Ensure that the estimates dataset is sorted by the parameter indicator */
proc sort data=est;
by Parameter;
run;
/*-----------------------------------------------------*/
/* Step 4: Pool the Results Using PROC MIANALYZE */
/*-----------------------------------------------------*/
/* Pool the regression estimates across imputations according to Rubin’s rules */
proc mianalyze data=est;
by Parameter;
modeleffects Estimate;
stderr StdErr;
run;
This example demonstrates a complete workflow for handling missing data with multiple imputations in SAS. The code performs the following steps:
-
Data Generation:
- Creates a synthetic dataset with 500 observations, including a categorical variable (
country
), a weight variable, a Likert-scale variable (var1
) with about 20% missing values, and an outcome variable (outcome
).
- Creates a synthetic dataset with 500 observations, including a categorical variable (
-
Imputation:
- Uses PROC SURVEYIMPUTE with the hot deck method to impute missing values in
var1
. - The
cells
statement ensures that imputation is done separately bycountry
, and thendonors=5
option selects five candidate donors per missing value. - The
IMPINDEX=_Imputation_
option creates an indicator variable (_Imputation_
) that marks each of the five imputations.
- Uses PROC SURVEYIMPUTE with the hot deck method to impute missing values in
-
Analysis by Imputation:
- Sorts the imputed data by
_Imputation_
and runs a survey regression (PROC SURVEYREG) separately for each imputation using aBY _Imputation_
statement. - The regression estimates (including standard errors) are output to a dataset (
est
).
- Sorts the imputed data by
-
Pooling Results:
- The estimates are then pooled using PROC MIANALYZE.
- Since the ODS output from PROC SURVEYREG is in long format (with one record per parameter per imputation), the data are first sorted by the
Parameter
variable, and then pooled by processing each parameter separately.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
something ain't right with this. if I do the analysis with by _imputation_, then it only runs the model in the imputed values in this case, only 1000 obs as opposed to 4000 which is the total dataset (obs with non-missing n=3000 + obs with missing n=1000)
ChatGPT is often wrong. I think you used chatGPT