Re: Using a lower bound for values imputed by PROC MI?

Buzzy_Bee · Posted 01-12-2021 10:32 AM

I was using PROC MI (multiple imputation) to impute missing values in continuous variables such as weight, height and age. I'm using FCS Regression with my Proc MI. However, I noticed that while most values looked really good and created a nice distribution, a few of the values displayed as negative.

After searching online, I found some posts on R and Python blogs where people talked about using a lower bound to set 0 as the lower bound for variables like age, weight and height, that can't plausibly take negative values. I was planning to add a lower bound, but I couldn't find any option in the PROC MI procedure that allows lower bounds to be set.

Some people using other programming languages also tried tactics like:

Set negative values to 0 using: if age<0 then age=0

Reverse the value using absolute so that -5 becomes 5: age = abs(age)

Personally, I would have thought both might skew the distribution a bit?

While I was searching for answers on the internet, I found this comment in a statistics textbook, which suggests you just leave the implausible values alone:

"Intuitively speaking, it makes sense to round values or incorporate bounds to give plausible values. However, these methods has been shown to decrease efficiency and increase bias by altering the correlation or covariances between variables estimated during the imputation process. Additionally, these changes will often result in an underestimation of the uncertainly around imputed values. Remember imputed values are NOT equivalent to observed values and serve only to help estimate the covariances between variables needed for inference (Johnson and Young 2011)."

Does anyone else have any thoughts about lower bounds, and if you've used these before in such situations, did it work out well? I couldn't find any similar posts on this website, but I'm sure it will be useful for others using imputation methods to read this and learn about how other people handled this. Thanks for your thoughts.

FreelanceReinh · Posted 01-13-2021 05:40 AM

Hi @Buzzy_Bee,

I remember that a post by ballardw last week mentioned the MINIMUM= option of the PROC MI statement to specify a lower bound for imputed values: https://communities.sas.com/t5/Statistical-Procedures/Proc-MI-impute-only-some-missing-values-and-se.... But I don't have any experience with this procedure.

Buzzy_Bee · Posted 01-13-2021 11:44 AM

Thanks very much - I was expecting it to be called Lower Bound or something like that.

I tried the Minimum=0 option that you suggested and that works well. It has produced a sensible looking normal distribution. It looks like there is a Maximum option also, but my Proc MI has automatically imputed up to the largest value so I didn't need to specify an upper bound.

SteveDenham · Posted 01-13-2021 09:05 AM

I am a big believer in what Johnson and Young propose. The amount of bias introduced into means is relatively small, and it retains the best estimates of variances/covariances. Truncating the values at a minimum cut-off will definitely decrease the estimates for the variance, resulting in poor Type I error performance.

SteveDenham

Buzzy_Bee · Posted 01-13-2021 12:01 PM

Thanks for your suggestion. Their theory certainly aligns more with how a statistician would think.

Leaving the distribution alone without specifying a minimum bound shows the Proc MI procedure has still ensured that measures of central tendency remain the same as the non-imputed variable in my data set. If I set the lower bound to 0, the mean is now a bit above the original mean, so I can see why Johnson and Young suggest bounds are not the best idea.

SteveDenham · Posted 01-13-2021 02:23 PM

One other approach might be to rescale the values, adding a constant to all the values after imputation, and thereby eliminating the negative values (sort of like an offset). This would probably be acceptable if you are using something like PROC GLM to analyze the imputed datasets. However, a non-linear analysis (GENMOD, GEE, GLIMMIX) may end up with results that are biased based on the size of the offset value.

SteveDenham

Buzzy_Bee · Posted 01-18-2021 01:42 PM

That's a good idea too. I hadn't thought of that.

VioletaB · Posted 01-09-2024 03:56 AM

If possible, can you share the reference to Johnson and Young. Thank you.

SteveDenham · Posted 01-09-2024 08:43 AM

I apologize, but I don't recall the source now. Google Scholar should help finding it.

SteveDenham

SAS_Rob · Posted 01-09-2024 08:58 AM

This usage note should also be helpful in regards to your question. It contains the full citation information for the Johnson and Young reference that @SteveDenham mentioned as well.

24475 - Handling the ERROR: "An imputed variable value is not in the specified range after 100 tries...

SteveDenham · Posted 01-09-2024 08:59 AM

Thanks for bailing me out @SAS_Rob

SteveDenham

The 2025 SAS Hackathon has begun!