Solved: Missing Value Imputation

rbettinger · Posted 01-18-2022 07:07 PM

I have written a paper, "Missing Value Imputation", that I presented at the October Southeast SAS Users' Group 2021 meeting. It contains a historical summary of attempts to perform imputation of missing values, a detailed description of the mechanisms of missingness, and an application of the fuzzy c-means algorithm (FCM) to Census data to perform missing value imputation. I would like to share it with the SAS community. There are two macros, %FCM and %FCM_IMPUTE, which are included as attachments to this post. If you cannot access them, contact me and I will send them to you.

Here is the paper, which I have copied into this post for those of you who cannot access the PDF document of the original paper.

Unfortunately, I cannot paste the entirety of the document into this post. Please contact me directly and I will send you the PDF of the paper and the macros.

Rick_SAS · Posted 01-19-2022 06:40 AM

Thanks for this post. I look forward to reading it. This seems like a good candidate for an article for the SAS Communities Library. The Library often hosts papers and the associated code for SAS-related content. You might want to look into it.

I took a quick look at the code, and I'd like to suggest that you do not need either of the user-defined functions in your PROC IML step.

The DIST_FNC module can be replaced by using the DISTANCE function in IML. (This capability of the DISTANCE function was released as part of SAS 9.4M5.) FOr background and a different "manual" implementation, see "Distances between observations in two groups."
The MATVEC_SS module can be replaced by the one-liner SSQ(X - c), where c is the row vector.

Again, thanks for posting.

View solution in original post

Rick_SAS · Posted 01-19-2022 06:40 AM

Thanks for this post. I look forward to reading it. This seems like a good candidate for an article for the SAS Communities Library. The Library often hosts papers and the associated code for SAS-related content. You might want to look into it.

I took a quick look at the code, and I'd like to suggest that you do not need either of the user-defined functions in your PROC IML step.

The DIST_FNC module can be replaced by using the DISTANCE function in IML. (This capability of the DISTANCE function was released as part of SAS 9.4M5.) FOr background and a different "manual" implementation, see "Distances between observations in two groups."
The MATVEC_SS module can be replaced by the one-liner SSQ(X - c), where c is the row vector.

Again, thanks for posting.

Season · Posted 10-15-2023 12:42 AM

Hello, thanks for sharing your work to us! I have some simple yet possibly unanswered questions to raise concerning multiple imputation (MI). In one word, that is: How should we pool the point estimands of the respective imputation sample when the estimands do not (necessarily) follow a normal distribution?

One of the most common problems of this kind encountered in data analyses is the pooling of median: suppose I have a variable with missing data and does not follow a normal distribution, then median is the correct statistic for describing the central tendency of the variable. Now that I have computed M times and therefore have M medians, how should I pool them given the fact that while the mean follows an asymptotic normal distribution while this is not necessarily the case for median and that Rubin's rule of pooling the estimands is based on asymptotic normality?

Variable transform is a possible choice. That is, we bypass this problem by transforming the variable into another one that follows a normal distribution via methods like Box-Cox transformation and reported the pooled mean and standard deviation of the transformed rather than the original variable. But given the complexity of Box-Cox transformation to practitioners without professional statistical training (e.g., medical doctors) and the loss of "intuitiveness" and "explanability" (e.g., it is difficult for a person to tell what the mean and standard deviation of a Box-Cox transformed triglyceride are all about, especially for medical doctors), this method does not seem to work.

Similar situations frequently occur in medical statistics. Sensitivity, specificity and Youden index are all examples of the problems of this kind.

So, when pooling the estimands seem to violate the rationale of Rubin's rule, what should we do?

Many thanks!

rbettinger · Posted 10-15-2023 12:00 PM

Thank you for your interest in my work.

As a general rule, whenever you impute missing values, you are adding
"value" to the data. This "value" may be helpful in that it solves the
problem of missingness but it may be a hindrance when, as you have
described the transformations, the result is to add complications to
understanding the results. If you, as a trained and literate practitioner
of statistics, have doubts and uncertainties about the imputation process,
how much more so someone who may be an expert in a specialized field but
not in statistical reasoning and practice.

- I once tried to explain the concept of compound average growth rate to
a manager and he shook his head to indicate that he didn't understand what
I had told him. So I simplified the answer to "it's like compound interest"
and he nodded.

- I once tried to explain the concept of area under the ROC curve to my
manager. He barely understood the concept of a 2x2 classification table,
but once I said "and you build a new table for different values of cut-off
value p and then plot true positive vs false positive as a parametric
graph", he shook his head and I stopped trying to explain the AUROC idea
because he was innumerate. He had a degree in ChemE but he was a yutz
(Google on it).

Basically, the answer must be on the same level of sophistication as your
audience or they will at best ignore your words and at worst say that you
can't communicate.

So, for all of these words that you have endured reading until now, I would
compute the median of the medians and report that statistic to the users.
Hopefully, your results will not be too far off from the "true" but unknown
value of the missing variable(s). Your results will not be theoretically
elegant, but your audience will immediately understand what you have done.
As a test, you might artificially set some values in complete data to
missing, run the MI procedure to generate medians, and compare the
estimated medians on the "pseudomissing" data to the medians on the
complete case data as a sanity check. Be sure to use the same percentage of
induced missingness in the simulation of missingness as already exists in
the original data to make the results realistic.

While this "seat of the pants" method is not particularly elegant, you are
not writing a PhD dissertation. I seriously doubt that any prospective
users will say "Seriously? Is this the best that you can do?" They will be
more likely to say, "Thank you for providing us a solution that we can
understand. Now we can go forward." Optimize when you can, satisfice when
you cannot. Sometimes, good enough is enough.

Best regards,
Ross

Season · Posted 10-20-2023 11:48 PM

Thank you very much for your patient reply! I can feel the passion you conveyed to a stranger who is interested in your research but had not uttered a word to you ever before. Thank you for your kind and passion!

I do agree with your opinion that the interest of audience is of great importance on how you present your research results. Prior to raising my question here, I browsed extensively over the literatures in multiple imputation (MI) and thought about that problem on my own. The only way I could thought when clarity and simplicity was taken into consideration was to use the median of estimates of the statistics to be pooled as the estimate of the pooled statistics. But my notion was not supported by previous research, so I raised my question here to see if better methods existed.

Here are some more advanced topics concerning MI. Are there any regression diagnostic methods in the presence of MI? The word "regression" in the phrase "regression diagnostic" here stands for the model to be built with the imputed datasets rather than the imputation models. More specifically, I would like to ask about the ways of assessing outliers, collinearity and strong influential observations in the presence of MI. The multiple samples created by MI has enabled a more robust point estimate of the estimands, but it also causes confusions to data analyst as to which sample to choose when it comes to compute regression diagnostic statistics.

Thank you again for your kind help!

rbettinger · Posted 10-21-2023 11:20 AM

Thank you for your kind words, Season. Now that I am retired, I have the
luxury to investigate topics of interest in detail and I can then describe
my results for the user community to enjoy.

I have no real experience with MI. I looked through the SAS description of
PROC MI and saw that the MI algorithm is a very intricate procedure with
which to perform missing value imputation.

I wrote a paper on fuzzy c-means imputation which I posted to the SAS
Community:
https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/Missing-Value-Imputation/m-p/790785.
It deals with missing value imputation quite effectively and, IMHO, much
more simply than the MI procedure. It is a popular tool and there is much
support for it in the published literature.

HTH,
Ross

Season · Posted 10-21-2023 11:40 AM

All right. Thank you for your attention paid to and time spent on me!

Missing Value Imputation

Re: Missing Value Imputation

Re: Missing Value Imputation

Re: Missing Value Imputation

Re: Missing Value Imputation

Re: Missing Value Imputation

Re: Missing Value Imputation

Re: Missing Value Imputation

SAS Innovate 2026 Registration is Open