topic Re: Data Imputation in SAS Data Science

Data Imputation

SlutskyFan — Tue, 01 Feb 2011 15:10:48 GMT

I just read the section on M-estimators from SAS's reference to 'Understanding Robust and Explorotory Data Analysis (Hoaglin, Mosteller, & Tukey) and have a good understanding of M-estimators. I understand their advantages over say mean imputation, but does anyone have any advice for when M-estimators would be better or worse than say tree imputation or distribution methods?

Any references on the pros and cons of each of these methods? It just seems more natural to me to use tree imputation vs. a point estimate like the mean, or even an M-estimator regardless of its resistance or robustness efficiency. It seems like tree imputation is just 'more informed'? (although I guess theoretically, a lot of information is captured by a mean or an M-estimator of location in the sense that most of the observations should be centered around these points.)

Any thoughts?

Re: Data Imputation

topkatz — Fri, 04 Feb 2011 23:08:32 GMT

Hi.

I did a quick web search on comparison of imputation methods. There were a bunch of journal articles that I didn't want to buy, but the abstracts all said similar things -- imputing was better than not imputing, and multivariate methods outperformed univariate methods.

SAS software seems to be lagging the state of the art in imputation by about a decade -- I think their last serious improvement for imputation was when they added PROC MI to SAS/STAT about ten years ago (and that methodology had already been around for twenty years at that time). Enterprise Miner doesn't appear to offer expectation maximization for multiple imputation, but it has a few methods not available in STAT, notably tree imputation, as you mentioned.

I once read a pretty convincing endorsement of cluster imputation given by one of the eminent senior statisticians at SAS, Warren Sarle -- I wish I could find it, I'd copy it here. Cluster imputation is kind of a compromise between univariate and multivariate methods. Finding the clusters is a multivariate technique, but once you have the clusters, you do a simple substitution of cluster means or medians for the missing values of observations within each cluster (I suppose you could do M-estimators within each cluster, if you wanted to). You can get cluster imputation in both SAS/STAT and Enterprise Miner, but you have to know where to look. In SAS/STAT it's in PROC FASTCLUS; in Enterprise Miner, it's in the Cluster node, not the Impute node.

Re: Data Imputation

SlutskyFan — Sat, 12 Feb 2011 01:35:34 GMT

Thanks! That was very helpful.