Solved: Re: Cluster-mean imputation

pvareschi · Posted 05-30-2020 01:06 AM

Re: Predictive Modeling Using Logistic Regression

In cluster-mean imputation (page 3-11 and appendix B-7 of course text), should the variables used to define the clusters be restricted to those with missing values or could include all of them (i.e. with and without missing values)?

Moreover, would it be possible to clarify statement at the bottom of page 3.11 of course text: “A simpler but sometimes useful alternative is to define a priori segments (for example, high, middle, low and unknown income) and then do mean or median imputation within each segment”.

Not sure I understand the benefits of creating the above segments; however, I understand how the example shown on page 3.12 works: is the wording of page 3.11 correct?

sasmlp · Posted 06-01-2020 01:26 PM

PROC FASTCLUS can be used to replace the missing values with the cluster means from the training data set. The data is split into training and validation data sets, and PROC FASTCLUS is used to compute the cluster means on the training data set and save the cluster means on an output data set. Then you use PROC FASTCLUS again to replace the missing values from the validation data set with the cluster means from the training data set. You use all the variables to define your clusters (even variables with missing values). Use the IMPUTE option which requests imputation of missing values after the final assignment of observations to clusters.

If you use PROC STDIZE without a BY statement, then you use unconditional imputation. Sometimes using a priori segments and using conditional imputation is helpful. For example, suppose income is related to education. Instead of using an unconditional mean imputation of income, maybe a conditional mean imputation of income based on education (high, middle, and low) might give you more reasonable values for income based on education. This might improve the predictive accuracy of the model, especially if income is related to the target.

View solution in original post

sasmlp · Posted 06-01-2020 01:26 PM

PROC FASTCLUS can be used to replace the missing values with the cluster means from the training data set. The data is split into training and validation data sets, and PROC FASTCLUS is used to compute the cluster means on the training data set and save the cluster means on an output data set. Then you use PROC FASTCLUS again to replace the missing values from the validation data set with the cluster means from the training data set. You use all the variables to define your clusters (even variables with missing values). Use the IMPUTE option which requests imputation of missing values after the final assignment of observations to clusters.

If you use PROC STDIZE without a BY statement, then you use unconditional imputation. Sometimes using a priori segments and using conditional imputation is helpful. For example, suppose income is related to education. Instead of using an unconditional mean imputation of income, maybe a conditional mean imputation of income based on education (high, middle, and low) might give you more reasonable values for income based on education. This might improve the predictive accuracy of the model, especially if income is related to the target.

Cluster-mean imputation

Re: Cluster-mean imputation

Re: Cluster-mean imputation

Click image to register for webinar

Classroom Training Available!