09-01-2016 10:18 AM
I am trying to cluster a dataset of over 160000 rows, some of the variables have great missing percentage (around 50%, some well above 50% or even 90%) but since the number of obervations is relatively large could imputing could work and no need to reject?
If you consider imputing an option, which imputing method would you consider?
Thanks a lot!
P.S. I take for granting that Imputation is also needed in clustering, I usually hear about imputing as a pre-step in predictive modelling using trees/regression/neural networks
09-08-2016 01:52 PM
10-05-2016 06:54 AM
I really thank you for your detailed answer and I am sorry for my late response,
I did try different imputing methods with pretty much the same final clustering results. The final solution was a combination of imputing using tree model and imputing using a constant value. As you said "It is not really technique. It is business knowledge", so missing values could by quessed in most of the cases if you had good knowledge of the industry/business. In those cases imputing with constant value was used. In some other cases where not a constant value was expected but missing values made no sense either, imputing using a Tree model was used.
Thank you very much,