Learning SAS? Welcome to the exclusive online community for all SAS learners.

How do you compare different methods when performing cluster analysis in SAS?

Reply
N/A
Posts: 1

How do you compare different methods when performing cluster analysis in SAS?

How do you compare different methods when performing cluster analysis in SAS? Is there a statistic that tells you how the model performs?

Frequent Contributor
Posts: 130

Re: How do you compare different methods when performing cluster analysis in SAS?

Unfortunately there is no single test statistic that will do that. I advise my students to use hierarchical cluster models to settle on a reasonable number of clusters but then use a non-hierarchical method to produce a better cluster solution for that given number of clusters. It is hard to know what the 'right' number of clusters is, but you can usually recognise a useful cluster solution when you profile clusters by other, non-basis, observed variables.
Occasional Contributor
Posts: 5

Re: How do you compare different methods when performing cluster analysis in SAS?

[ Edited ]

As @Damien_Mather said, there´s no easy solution. In fact, thare are many strategies and methods to try on. For example, you can use proc cluster based on each of the distances available in proc distance, or, if you have a very big dataset (variables), first perform a factor analysis to reduce the number of columns and make things simpler and faster, specially with SAS Studio, that is a solution for learning purposes and can´t handle very big datasets. You may try the different clustering methods also, and when you "cross" distances available in SAS with the different methods in proc cluster things go for a higher dimension of analysis, because you have to manually evaluate each solution found, and this one is the painfull part.

 

So first things first: look at your variables and see if you can reduce them to a manageable set, ie, grouping them into factors. Then look for different distances and methods that apply to your data and run cluster analysis using different strategies: as I said, using proc cluster, or ace cluster + fast cluster + proc cluster, it all depends on the nature of your data and purpose of your analysis. Evaluate and find the final solution.

 

Now, why things get hard? Because, for example, for each - each - distance available that you test for cluster analysis (considering you´re trying just one strategy), you have to try different number of clusters, and after that, evaluate number of observations in each cluster, cluster composition and separation from other clusters and the variables that work as drivers in order to meaningfully name them. 

 

Then, with this information in hands, you go for the final solution by yourself if you now well the bussiness from wich the data come from, or you present two or three possible solutions for the ones that have this knowledge. They will point out a solution and better understanting.

 

Hope this helps.

Ask a Question
Discussion stats
  • 2 replies
  • 278 views
  • 0 likes
  • 3 in conversation