I was using PROC MI (multiple imputation) to impute missing values in continuous variables such as weight, height and age. I'm using FCS Regression with my Proc MI. However, I noticed that while most values looked really good and created a nice distribution, a few of the values displayed as negative.
After searching online, I found some posts on R and Python blogs where people talked about using a lower bound to set 0 as the lower bound for variables like age, weight and height, that can't plausibly take negative values. I was planning to add a lower bound, but I couldn't find any option in the PROC MI procedure that allows lower bounds to be set.
Some people using other programming languages also tried tactics like:
Set negative values to 0 using: if age<0 then age=0
Reverse the value using absolute so that -5 becomes 5: age = abs(age)
Personally, I would have thought both might skew the distribution a bit?
While I was searching for answers on the internet, I found this comment in a statistics textbook, which suggests you just leave the implausible values alone:
"Intuitively speaking, it makes sense to round values or incorporate bounds to give plausible values. However, these methods has been shown to decrease efficiency and increase bias by altering the correlation or covariances between variables estimated during the imputation process. Additionally, these changes will often result in an underestimation of the uncertainly around imputed values. Remember imputed values are NOT equivalent to observed values and serve only to help estimate the covariances between variables needed for inference (Johnson and Young 2011)."
Does anyone else have any thoughts about lower bounds, and if you've used these before in such situations, did it work out well? I couldn't find any similar posts on this website, but I'm sure it will be useful for others using imputation methods to read this and learn about how other people handled this. Thanks for your thoughts.
Hi @Buzzy_Bee,
I remember that a post by ballardw last week mentioned the MINIMUM= option of the PROC MI statement to specify a lower bound for imputed values: https://communities.sas.com/t5/Statistical-Procedures/Proc-MI-impute-only-some-missing-values-and-se.... But I don't have any experience with this procedure.
Thanks very much - I was expecting it to be called Lower Bound or something like that.
I tried the Minimum=0 option that you suggested and that works well. It has produced a sensible looking normal distribution. It looks like there is a Maximum option also, but my Proc MI has automatically imputed up to the largest value so I didn't need to specify an upper bound.
I am a big believer in what Johnson and Young propose. The amount of bias introduced into means is relatively small, and it retains the best estimates of variances/covariances. Truncating the values at a minimum cut-off will definitely decrease the estimates for the variance, resulting in poor Type I error performance.
SteveDenham
Thanks for your suggestion. Their theory certainly aligns more with how a statistician would think.
Leaving the distribution alone without specifying a minimum bound shows the Proc MI procedure has still ensured that measures of central tendency remain the same as the non-imputed variable in my data set. If I set the lower bound to 0, the mean is now a bit above the original mean, so I can see why Johnson and Young suggest bounds are not the best idea.
One other approach might be to rescale the values, adding a constant to all the values after imputation, and thereby eliminating the negative values (sort of like an offset). This would probably be acceptable if you are using something like PROC GLM to analyze the imputed datasets. However, a non-linear analysis (GENMOD, GEE, GLIMMIX) may end up with results that are biased based on the size of the offset value.
SteveDenham
That's a good idea too. I hadn't thought of that.
If possible, can you share the reference to Johnson and Young. Thank you.
I apologize, but I don't recall the source now. Google Scholar should help finding it.
SteveDenham
This usage note should also be helpful in regards to your question. It contains the full citation information for the Johnson and Young reference that @SteveDenham mentioned as well.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.