I find that the main challenges of Clustering are two: 1. one acts on a sample. This entails monumental consequences. Different samples share no points with probability almost 1. So that you can never claim replicability in clustering if you follow any of the many extant algorithms that go on sequentially. Only if you act on "central points", actually local means, you can claim replicability. 2. sequential methods reach a solution, of course. However, you never know how much the solution is far form the optimal one. Going parallel has two advantages: 1. you find "central points", that is points that have many surrounding ones so that they don't move during iterations. Central points are local means that have a surrounding subsample, aka cluster. This makes their standard error to be much less than the standard deviation that measures the variability of single points. So that, if you follow the "any point is good" approach, where all points are equivalent, you are exposed to the variability of sigma, while if you act on "central points", actually local means, you face a much smaller variability, actually a fraction of the sample standard deviation. That's why central points remain stable during iterations 2. you avoid the worst of sequential problems, where the solution varies with the point you start from. Because you act on points that have a high variability, the first point decides which one the solution will be. In my opinion, the method should be parallel.
... View more