I have written a paper, "Missing Value Imputation", that I presented at the October Southeast SAS Users' Group 2021 meeting. It contains a historical summary of attempts to perform imputation of missing values, a detailed description of the mechanisms of missingness, and an application of the fuzzy c-means algorithm (FCM) to Census data to perform missing value imputation. I would like to share it with the SAS community. There are two macros, %FCM and %FCM_IMPUTE, which are included as attachments to this post. If you cannot access them, contact me and I will send them to you.
Here is the paper, which I have copied into this post for those of you who cannot access the PDF document of the original paper.
Unfortunately, I cannot paste the entirety of the document into this post. Please contact me directly and I will send you the PDF of the paper and the macros.
Thanks for this post. I look forward to reading it. This seems like a good candidate for an article for the SAS Communities Library. The Library often hosts papers and the associated code for SAS-related content. You might want to look into it.
I took a quick look at the code, and I'd like to suggest that you do not need either of the user-defined functions in your PROC IML step.
Again, thanks for posting.
Thanks for this post. I look forward to reading it. This seems like a good candidate for an article for the SAS Communities Library. The Library often hosts papers and the associated code for SAS-related content. You might want to look into it.
I took a quick look at the code, and I'd like to suggest that you do not need either of the user-defined functions in your PROC IML step.
Again, thanks for posting.
Hello, thanks for sharing your work to us! I have some simple yet possibly unanswered questions to raise concerning multiple imputation (MI). In one word, that is: How should we pool the point estimands of the respective imputation sample when the estimands do not (necessarily) follow a normal distribution?
One of the most common problems of this kind encountered in data analyses is the pooling of median: suppose I have a variable with missing data and does not follow a normal distribution, then median is the correct statistic for describing the central tendency of the variable. Now that I have computed M times and therefore have M medians, how should I pool them given the fact that while the mean follows an asymptotic normal distribution while this is not necessarily the case for median and that Rubin's rule of pooling the estimands is based on asymptotic normality?
Variable transform is a possible choice. That is, we bypass this problem by transforming the variable into another one that follows a normal distribution via methods like Box-Cox transformation and reported the pooled mean and standard deviation of the transformed rather than the original variable. But given the complexity of Box-Cox transformation to practitioners without professional statistical training (e.g., medical doctors) and the loss of "intuitiveness" and "explanability" (e.g., it is difficult for a person to tell what the mean and standard deviation of a Box-Cox transformed triglyceride are all about, especially for medical doctors), this method does not seem to work.
Similar situations frequently occur in medical statistics. Sensitivity, specificity and Youden index are all examples of the problems of this kind.
So, when pooling the estimands seem to violate the rationale of Rubin's rule, what should we do?
Many thanks!
Thank you very much for your patient reply! I can feel the passion you conveyed to a stranger who is interested in your research but had not uttered a word to you ever before. Thank you for your kind and passion!
I do agree with your opinion that the interest of audience is of great importance on how you present your research results. Prior to raising my question here, I browsed extensively over the literatures in multiple imputation (MI) and thought about that problem on my own. The only way I could thought when clarity and simplicity was taken into consideration was to use the median of estimates of the statistics to be pooled as the estimate of the pooled statistics. But my notion was not supported by previous research, so I raised my question here to see if better methods existed.
Here are some more advanced topics concerning MI. Are there any regression diagnostic methods in the presence of MI? The word "regression" in the phrase "regression diagnostic" here stands for the model to be built with the imputed datasets rather than the imputation models. More specifically, I would like to ask about the ways of assessing outliers, collinearity and strong influential observations in the presence of MI. The multiple samples created by MI has enabled a more robust point estimate of the estimands, but it also causes confusions to data analyst as to which sample to choose when it comes to compute regression diagnostic statistics.
Thank you again for your kind help!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.