Statistical Procedures

Programming the statistical procedures from SAS
BookmarkSubscribeRSS Feed
HLuffy
Calcite | Level 5

I create a n clusters using SAS miner HP cluster nodes( K means ). but every time I try to replicate the same clusters it give a different clusters. even using EG with different initializations give me different clusters.  my questions are :

1- Is  there a way to fix this clusters and make it my work replicable?  

2- If I can't fixe my clusters is there a way to test the stability of my clusters using for example an overlap rate and said after 75 % we can said that the clusters are stable? 

3- I couldn't find any straight forward answer for the stability of the clusters and how it's important. can we speak about the stability of clustering in this situation? is it very important to test the stability before use the clusters? which measures can do that ? is there any nodes in sas miner can do that?

I'm a little bit lost with this question of the stability. thank you for your understanding !!

9 REPLIES 9
Reeza
Super User

K-Means clustering doesn't have a single unique solution, more so, there's a set of possible solutions and it's about picking one that makes the most sense for your use case. Especially if you change the initialization parameters then the clusters will be different. 

 

If your clusters are unstable it means your clusters are possibly not unique enough and you should reduce the number of clusters to get a more stable solution. How did you pick the number of clusters?

HLuffy
Calcite | Level 5
Thanks Reeza for he quick reply. I did choose the global peak value for estimation criterion and choose the number of cluster between 2 and 10 and the result give me 6 clusters. how I can now is my cluster are stable is there any way to know the number of iterations to find the stability? btw I choose the Euclidean distance for the similarity distance.
Reeza
Super User
I'm not aware of a stability measure (it may exist, just not aware) but it's pretty subject dependent as well I suspect.
For the # of clusters, did you look at the graphs and use the elbow method to determine the optimal # of clusters?
And just as an FYI stability isn't always possible in a clustering model and you'll almost never get 100% stability with real data.
HLuffy
Calcite | Level 5
the number was determined by ABC criterion. but my pb it's not in the number of the cluster but in the clusters they change after every repetition. is is the number of the cluster that cause that? I know that I can had 100 % stability but is there any heuristic rules or academics way after a certain percentage take decision about the stability?
HLuffy
Calcite | Level 5

HLuffy_0-1640904456754.png

 

here's a picture of selection # of clusters using ABC selection

Reeza
Super User
Sorry, not super familiar with the output of EM Miner for Clustering. I definitely cannot interpret a graph without axes, context or titles.
HLuffy
Calcite | Level 5
I updated the graph if that can help you to understand . ABC Statistics — displays the aligned box criterion statistics. The horizontal axis is the number of clusters and the vertical axis is the gap between the error measure from the reference data and the input data. The vertical line indicates the estimated number of clusters for the data.
Estimation Criterion — specifies the estimation criterion used in the aligned box criterion method. Global Peak Value uses the maximum peak value across all peak values in the gap statistics.
Reeza
Super User
Sorry, this is beyond my current recall for K-Means.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1624 views
  • 0 likes
  • 2 in conversation