BookmarkSubscribeRSS Feed
VanDalucas
Obsidian | Level 7

Hi,

 

I am trying to cluster a dataset of over 160000 rows, some of the variables have great missing percentage (around 50%, some well above 50% or even 90%) but since the number of obervations is relatively large could imputing could work and no need to reject? 

If you consider imputing an option, which imputing method would you consider?

 

Thanks a lot!

 

P.S. I take for granting that Imputation is also needed in clustering, I usually hear about imputing as a pre-step in predictive modelling using trees/regression/neural networks 

2 REPLIES 2
JasonXin
SAS Employee
Hi, The act to impute is mainly to keep the observations, in other words to preserve the model universe. The price is how much distortion you can accept and pay. The least distortive is to find out the reason behind the missing values. It is very rare that the model universe has all its data sourced from just 1 or 2 tables. It is almost always the case that the model universe is assembled from various sources, easily in the range of >10 tables. One top source of missing values is what I sometime call 'left join syndrome'. The left side table is your master table of 160,000 IDs, but the right hand side table may only have 52% ID that can match to the left hand side. So you have ~48% missing on all the variables you are appending from the right hand table. Now the nature of the right hand table is key to your imputation. It is not really technique. It is business knowledge. The question is if a table missing 48% is overall useful at all. If the answer is YES, then you can dive into individual variables. It is a good practice to keep a 'missing lineage' when you merge throughout the universe preparation process. After you go through business background, here are some rules of thumb. For categorical /nominal variables with >50% missing, I would drop them, regardless if this is clustering or supervised model. Because if you have many categories, you group the missing portion with one of them, you have no ground to promote that non-missing group to dominate the variable. The artificial impact of this practice is more severe in clustering than supervised model. In, say a decision tree modeling, carrying missing values as it is can add value to the model with little distortion. Clustering does not have this sort of mechanism. If you assign a unique value to replace the missing portion, you create a dominating but artificial value. This is where it becomes tricky depending on how your clustering solution parametrizes the categorical variable. If the categorical variable has >> 50% non-missing, I am comfortable grouping the missing portion with one of the non-missing groups. In SAS clustering, there is a random option that allows you to impute according to the distribution of the non-missing. This actually is available to both categorical and interval variables. As for interval variables, if you have many input variables to spend, you can afford to raise the non-missing requirement % bar and drop more variables. If you don't have many variables, you may tolerate variables that have many missing values. In playing with the requirement %, you need to closely consult definition of the variable. Some 'important' variables having large % of missing may have to stay. In other words, you need to balance. One recommendation is to try different imputation methods (means, median, random) and assess their respective impact on your clustering solution. When there are many variables, you may consider variable clustering with different imputation methods and assess impact accordingly. There is no fast rule which way is better. This is where packages like SAS EM provide a huge productivity edge in that it documents and compares more efficiently than code-programming. One thing special about clustering is scale. It is necessary to scale all the input variables together for clustering. Whether you should impute before or after re-scaling/ standardization is another layer of complexity. There are other aspects related to what distance measure you use in your clustering. I will leave that to another day. Hope this help? Thanks for using SAS. Best Regards Jason Xin
VanDalucas
Obsidian | Level 7

Mr.Xin,

 

I really thank you for your detailed answer and I am sorry for my late response,

 

I did try different imputing methods with pretty much the same final clustering results. The final solution  was a combination of imputing using tree model and imputing using a constant value. As you said "It is not really technique. It is business knowledge",  so missing values could by quessed in most of the cases if you had good knowledge of the industry/business. In those cases imputing with constant value was used. In some other cases where not a constant value was expected but missing values made no sense either, imputing using a Tree model was used.

 

Thank you very much,

Vangelis

SAS Innovate 2025: Register Today!

 

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 2060 views
  • 3 likes
  • 2 in conversation