Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

☑ This topic is **solved**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 01-18-2022 07:07 PM
(1943 views)

I have written a paper, "Missing Value Imputation", that I presented at the October Southeast SAS Users' Group 2021 meeting. It contains a historical summary of attempts to perform imputation of missing values, a detailed description of the mechanisms of missingness, and an application of the fuzzy c-means algorithm (FCM) to Census data to perform missing value imputation. I would like to share it with the SAS community. There are two macros, %FCM and %FCM_IMPUTE, which are included as attachments to this post. If you cannot access them, contact me and I will send them to you.

Here is the paper, which I have copied into this post for those of you who cannot access the PDF document of the original paper.

Unfortunately, I cannot paste the entirety of the document into this post. Please contact me directly and I will send you the PDF of the paper and the macros.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for this post. I look forward to reading it. This seems like a good candidate for an article for the SAS Communities Library. The Library often hosts papers and the associated code for SAS-related content. You might want to look into it.

I took a quick look at the code, and I'd like to suggest that you do not need either of the user-defined functions in your PROC IML step.

- The DIST_FNC module can be replaced by using the DISTANCE function in IML. (This capability of the DISTANCE function was released as part of SAS 9.4M5.) FOr background and a different "manual" implementation, see "Distances between observations in two groups."
- The MATVEC_SS module can be replaced by the one-liner
**SSQ(X - c)**, where c is the row vector.

Again, thanks for posting.

6 REPLIES 6

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for this post. I look forward to reading it. This seems like a good candidate for an article for the SAS Communities Library. The Library often hosts papers and the associated code for SAS-related content. You might want to look into it.

I took a quick look at the code, and I'd like to suggest that you do not need either of the user-defined functions in your PROC IML step.

- The DIST_FNC module can be replaced by using the DISTANCE function in IML. (This capability of the DISTANCE function was released as part of SAS 9.4M5.) FOr background and a different "manual" implementation, see "Distances between observations in two groups."
- The MATVEC_SS module can be replaced by the one-liner
**SSQ(X - c)**, where c is the row vector.

Again, thanks for posting.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello, thanks for sharing your work to us! I have some simple yet possibly unanswered questions to raise concerning multiple imputation (MI). In one word, that is: **How should we pool the point estimands of the respective imputation sample when the estimands do not (necessarily) follow a normal distribution**?

One of the most common problems of this kind encountered in data analyses is the pooling of median: suppose I have a variable with missing data and does not follow a normal distribution, then median is the correct statistic for describing the central tendency of the variable. Now that I have computed *M *times and therefore have *M *medians, how should I pool them given the fact that while the **mean **follows an asymptotic normal distribution while this is not necessarily the case for **median **and that Rubin's rule of pooling the estimands is based on asymptotic normality?

Variable transform is a possible choice. That is, we bypass this problem by transforming the variable into another one that follows a normal distribution via methods like Box-Cox transformation and reported the pooled mean and standard deviation of the transformed rather than the original variable. But given the complexity of Box-Cox transformation to practitioners without professional statistical training (e.g., medical doctors) and the loss of "intuitiveness" and "explanability" (e.g., it is difficult for a person to tell what the mean and standard deviation of a Box-Cox transformed triglyceride are all about, especially for medical doctors), this method does not seem to work.

Similar situations frequently occur in medical statistics. Sensitivity, specificity and Youden index are all examples of the problems of this kind.

So, when pooling the estimands seem to violate the rationale of Rubin's rule, what should we do?

Many thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for your interest in my work.

As a general rule, whenever you impute missing values, you are adding

"value" to the data. This "value" may be helpful in that it solves the

problem of missingness but it may be a hindrance when, as you have

described the transformations, the result is to add complications to

understanding the results. If you, as a trained and literate practitioner

of statistics, have doubts and uncertainties about the imputation process,

how much more so someone who may be an expert in a specialized field but

not in statistical reasoning and practice.

- I once tried to explain the concept of compound average growth rate to

a manager and he shook his head to indicate that he didn't understand what

I had told him. So I simplified the answer to "it's like compound interest"

and he nodded.

- I once tried to explain the concept of area under the ROC curve to my

manager. He barely understood the concept of a 2x2 classification table,

but once I said "and you build a new table for different values of cut-off

value p and then plot true positive vs false positive as a parametric

graph", he shook his head and I stopped trying to explain the AUROC idea

because he was innumerate. He had a degree in ChemE but he was a yutz

(Google on it).

Basically, the answer must be on the same level of sophistication as your

audience or they will at best ignore your words and at worst say that you

can't communicate.

So, for all of these words that you have endured reading until now, I would

compute the median of the medians and report that statistic to the users.

Hopefully, your results will not be too far off from the "true" but unknown

value of the missing variable(s). Your results will not be theoretically

elegant, but your audience will immediately understand what you have done.

As a test, you might artificially set some values in complete data to

missing, run the MI procedure to generate medians, and compare the

estimated medians on the "pseudomissing" data to the medians on the

complete case data as a sanity check. Be sure to use the same percentage of

induced missingness in the simulation of missingness as already exists in

the original data to make the results realistic.

While this "seat of the pants" method is not particularly elegant, you are

not writing a PhD dissertation. I seriously doubt that any prospective

users will say "Seriously? Is this the best that you can do?" They will be

more likely to say, "Thank you for providing us a solution that we can

understand. Now we can go forward." Optimize when you can, satisfice when

you cannot. Sometimes, good enough is enough.

Best regards,

Ross

As a general rule, whenever you impute missing values, you are adding

"value" to the data. This "value" may be helpful in that it solves the

problem of missingness but it may be a hindrance when, as you have

described the transformations, the result is to add complications to

understanding the results. If you, as a trained and literate practitioner

of statistics, have doubts and uncertainties about the imputation process,

how much more so someone who may be an expert in a specialized field but

not in statistical reasoning and practice.

- I once tried to explain the concept of compound average growth rate to

a manager and he shook his head to indicate that he didn't understand what

I had told him. So I simplified the answer to "it's like compound interest"

and he nodded.

- I once tried to explain the concept of area under the ROC curve to my

manager. He barely understood the concept of a 2x2 classification table,

but once I said "and you build a new table for different values of cut-off

value p and then plot true positive vs false positive as a parametric

graph", he shook his head and I stopped trying to explain the AUROC idea

because he was innumerate. He had a degree in ChemE but he was a yutz

(Google on it).

Basically, the answer must be on the same level of sophistication as your

audience or they will at best ignore your words and at worst say that you

can't communicate.

So, for all of these words that you have endured reading until now, I would

compute the median of the medians and report that statistic to the users.

Hopefully, your results will not be too far off from the "true" but unknown

value of the missing variable(s). Your results will not be theoretically

elegant, but your audience will immediately understand what you have done.

As a test, you might artificially set some values in complete data to

missing, run the MI procedure to generate medians, and compare the

estimated medians on the "pseudomissing" data to the medians on the

complete case data as a sanity check. Be sure to use the same percentage of

induced missingness in the simulation of missingness as already exists in

the original data to make the results realistic.

While this "seat of the pants" method is not particularly elegant, you are

not writing a PhD dissertation. I seriously doubt that any prospective

users will say "Seriously? Is this the best that you can do?" They will be

more likely to say, "Thank you for providing us a solution that we can

understand. Now we can go forward." Optimize when you can, satisfice when

you cannot. Sometimes, good enough is enough.

Best regards,

Ross

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you very much for your patient reply! I can feel the passion you conveyed to a stranger who is interested in your research but had not uttered a word to you ever before. Thank you for your kind and passion!

I do agree with your opinion that the interest of audience is of great importance on how you present your research results. Prior to raising my question here, I browsed extensively over the literatures in multiple imputation (MI) and thought about that problem on my own. The only way I could thought when clarity and simplicity was taken into consideration was to use the median of estimates of the statistics to be pooled as the estimate of the pooled statistics. But my notion was not supported by previous research, so I raised my question here to see if better methods existed.

Here are some more advanced topics concerning MI. **Are there any regression diagnostic methods in the presence of MI? The word "regression" in the phrase "regression diagnostic" here stands for the model to be built with the imputed datasets rather than the imputation models. More specifically, I would like to ask about the ways of assessing outliers, collinearity and strong influential observations in the presence of MI. **The multiple samples created by MI has enabled a more robust point estimate of the estimands, but it also causes confusions to data analyst as to which sample to choose when it comes to compute regression diagnostic statistics.

Thank you again for your kind help!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for your kind words, Season. Now that I am retired, I have the

luxury to investigate topics of interest in detail and I can then describe

my results for the user community to enjoy.

I have no real experience with MI. I looked through the SAS description of

PROC MI and saw that the MI algorithm is a very intricate procedure with

which to perform missing value imputation.

I wrote a paper on fuzzy c-means imputation which I posted to the SAS

Community:

https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/Missing-Value-Imputation/m-p/790785.

It deals with missing value imputation quite effectively and, IMHO, much

more simply than the MI procedure. It is a popular tool and there is much

support for it in the published literature.

HTH,

Ross

luxury to investigate topics of interest in detail and I can then describe

my results for the user community to enjoy.

I have no real experience with MI. I looked through the SAS description of

PROC MI and saw that the MI algorithm is a very intricate procedure with

which to perform missing value imputation.

I wrote a paper on fuzzy c-means imputation which I posted to the SAS

Community:

https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/Missing-Value-Imputation/m-p/790785.

It deals with missing value imputation quite effectively and, IMHO, much

more simply than the MI procedure. It is a popular tool and there is much

support for it in the published literature.

HTH,

Ross

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

All right. Thank you for your attention paid to and time spent on me!

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.