Solved: Clarification on data imputation for missing values

pvareschi · Posted 03-13-2020 04:34 AM

Topic "Dealing with Missing Values" in "Lesson 5: Regression Models Using SAS Enterprise Miner" (module 3: Predictive Modeler using SAS Enterprise Miner), with regard to synthetic distribution methods, states that: "A model trained with the modified training data will not be biased if the same modifications are made to any other data set that the model might encounter (and the data has a similar pattern of missing values)".

I do not understand, from a statistical point of view, why data imputation would not cause bias; would it not be more correct to say that bias will be introduced but it would be the same across training and validation datasets? In terms of net effect, I guess it would be negligible for ranking and decision predictions but it would be a concern for estimate predictions: is that correct?

Additionally, how should we choose whether to use the mean or median as the replacement value?

Would it be correct to say that for modelling techniques based on ranking of input values the median would be more suitable whereas methods based on the actual measurement (i.e. regression), the mean should be preferred?

Cynthia_sas · Posted 03-16-2020 09:24 PM

Hi:

We asked the class instructors and here's the response:

"Imputation of numeric inputs in a regression framework has an effect on the derived regression coefficient. For data skewed to the right, median<mean, for data skewed to the left, median>mean.

For simple regression with only main effects, you can anticipate how the regression coefficient will be influenced by choice of mean or median. If a certain percentage of an input variable values are smaller for a first data set than for a second data set, for a fixed regression coefficient, the predictions will be smaller (positive coefficient) or larger (negative coefficient) for the first data set. To bring the predictions closer in line, estimating the coefficient for each data set will produce a larger negative coefficient or a smaller positive coefficient for the first data set. For more complex models, like neural networks, it is difficult to assess the changes in the model brought about by switching from mean to median imputation.

In general, models derived using mean or median tend to perform about the same. With the model comparison node, you can pick the imputation method that gives the best result for the validation data, regardless of any intuition or rigorous theory.

With respect to how to choose mean or median imputation, do as stated above: pick the imputation method that gives the best result for the validation data."

And, a final comment: "From personal experience, I almost always pick the mean. (1) It is computationally easier to compute and thus uses fewer computer cycles; (2) Rarely will choice of mean or median make a substantial difference in prediction accuracy; (3) For tight deadlines, I am better served spending my time doing feature engineering than worrying about imputation methods. For choosing the imputation method, I would have to re-run my experiment on imputation method for every change I made to the model, because imputation method accuracy might be influenced by how imputed variables are correlated with other variables in the model."

Hope this helps,

Cynthia

View solution in original post

Cynthia_sas · Posted 03-16-2020 09:24 PM

Hi:

We asked the class instructors and here's the response:

"Imputation of numeric inputs in a regression framework has an effect on the derived regression coefficient. For data skewed to the right, median<mean, for data skewed to the left, median>mean.

For simple regression with only main effects, you can anticipate how the regression coefficient will be influenced by choice of mean or median. If a certain percentage of an input variable values are smaller for a first data set than for a second data set, for a fixed regression coefficient, the predictions will be smaller (positive coefficient) or larger (negative coefficient) for the first data set. To bring the predictions closer in line, estimating the coefficient for each data set will produce a larger negative coefficient or a smaller positive coefficient for the first data set. For more complex models, like neural networks, it is difficult to assess the changes in the model brought about by switching from mean to median imputation.

In general, models derived using mean or median tend to perform about the same. With the model comparison node, you can pick the imputation method that gives the best result for the validation data, regardless of any intuition or rigorous theory.

With respect to how to choose mean or median imputation, do as stated above: pick the imputation method that gives the best result for the validation data."

And, a final comment: "From personal experience, I almost always pick the mean. (1) It is computationally easier to compute and thus uses fewer computer cycles; (2) Rarely will choice of mean or median make a substantial difference in prediction accuracy; (3) For tight deadlines, I am better served spending my time doing feature engineering than worrying about imputation methods. For choosing the imputation method, I would have to re-run my experiment on imputation method for every change I made to the model, because imputation method accuracy might be influenced by how imputed variables are correlated with other variables in the model."

Hope this helps,

Cynthia

Clarification on data imputation for missing values

Re: Clarification on data imputation for missing values

Re: Clarification on data imputation for missing values

Clarification on data imputation for missing values

Re: Clarification on data imputation for missing values

Re: Clarification on data imputation for missing values

SAS Training: Just a Click Away