12-13-2012 10:39 AM
My questions concern PROC FASTCLUS, and I would appreciate any help that you can provide.
1) The name of the option MAXCLUSTERS = K suggests that K is the maximal number of clusters that PROC FASTCLUS will try, but the documentation suggests (but not definitively say) that this option forms only K clusters.
- What exactly does this option do: up to K clusters, or only K clusters?
- If it only forms K clusters, is there a way to try multiple values of K without manually copying and pasting multiple PROC FASTCLUS statements?
2) If I don't use the SEED option to specify initial seeds, then how are the initial means set? The documentation has the following vague statement that doesn't answer my question:
"If you do not specify the SEED= option, initial seeds are selected from the DATA= data set."
3) The option
RADIUS = t
suggests that this is a convergence criterion - when the radius is sufficiently small, the algorithm stops.
I thought that there was only one convergence criterion in K-Means clustering: when the cluster assignments don't change from one iteration to the next.
a) Am I wrong about that?
b) If so, which convergence criterion works better? (I'm guessing that there is a grey answer, but show me the grey.)
c) How do I implement the best convergence criterion automatically (without manually checking multiple criteria) in PROC FASTCLUS? Is this even possible?
12-15-2012 04:46 PM
If your data, according to the specifications of the PROC FASTCLUS algorithm, contains fewer than K clusters, then PROC FASTCLUS will display only that fewer-than-K number of clusters. As an example, generate data whose variable values contain only ten different values so that only ten different patterns of these variable values occur in 1,000 observations. If you specify MAXCLUSTERS=15, PROC FASTCLUS will find and display only ten clusters of observations, not 15.
You can write a SAS macro to specify different values of MAXCLUSTERS.
If you do not specify the SEED option data set, PROC FASTCLUS selects as its initial cluster seeds observations from the input SAS data set. If the latter show patterns that are correlated, then the clusters PROC FASTCLUS generates may not be very good in describing the range of variable values in the input SAS data set. Instead of using this SEED option data set, some researchers recommend instead generating a uniformly distributed random number for each observation on the input SAS data set and then sorting these observations by this random number before using this input SAS data set in PROC FASTCLUS. This way, any arbitrary patterns found in the order of the observations from the original input SAS data set are broken up so that the observations from this input SAS data set provide initial cluster seeds that better "represent" the pattern of variable values in the observations of the input SAS data set.
The RADIUS option is less a convergence criterion than a criterion to select initial cluster seeds (observations) so that these latter seeds are "far enough apart" from previously selected initial cluster seeds. This option is somewhat of an alternative to specifying a value for the MAXCLUSTERS option.
I don't know of a "best" convergence criterion for PROC FASTCLUS. What I would suggest that you do is to read the documentation chapter for SAS/STAT, "Introduction to Clustering Procedures", that describes how different SAS clustering procedures work and what are their strengths and weaknesses. This may help you decide whether PROC FASTCLUS is the most appropriate procedure for your applications. The PROC FASTCLUS documentation also hints on how to improve the separation among the clusters it finds.
12-19-2012 05:03 PM
Thanks for your detailed reply, 1zmm. This was very helpful!
4) I struggle to see how RADIUS is an alternative to MAXCLUSTERS. (Indeed, PROC FASTCLUS only needs one of these options to be specified, and I don't understand that.) If I don't specify how many clusters I want, then how does PROC FASTCLUS decide how many clusters to make?
5) In your statement,
"If your data, according to the specifications of the PROC FASTCLUS algorithm, contains fewer than K clusters......",
can you give an example of the specifications that can lead to some J clusters, where J < K, even though n > K, where n is the number of data?
Here is my guess: If I specify RADIUS such that it is very big, and my data are very concentrated, then I may get J = 10 clusters, even though I specified MAXCLUSTERS = 15.
Is my guess correct?
6) I want to directly set my own initial means. Is there a way to do that? (Once I figure out how to do that, then I can write a macro to try different initial means and compare their results.)
Thanks again for your help!