09-25-2015 02:21 PM - edited 09-25-2015 02:23 PM
I have a set of data that I want to apply a k-means clustering algorithm to, using PROC FASTCLUS.
Now, one of the issues that I face is that the dataset is relatively small; it consists of 25 individuals with 121 measurements per individual. In fact, it is EEG data from 25 individuals, so each measurement corresponds to the level of activity detected by a specific electrode in the array, and I am looking for emergent groupings of activity in the electrodes that may correspond to salient physiological regions of the brain.
As I am sure you are aware, clustering algorithms like PROC FASTCLUS are sensitive to the order of the data, especially in small samples. For now, I won't go into the reasons I prefer to use PROC FASTCLUS versus one of the other available methods; the question I face is independent of the clustering procedure anyway, but I am open to suggestions for other procedures that may mitigate or avoid the issue I'm about to describe.
Anyway, all that aside, my plan for finding clusters is to try and overcome the small sample size by repeatedly permuting the order of the input dataset, and run the clustering procedure some arbitrarily large number of times. In theory, this will allow me to stabilize the cluster assignments by choosing the final assignment based on some percentage cut off. For example, electrode 1 may be assigned to cluster 1 because it is assigned to that cluster in 80% of the permutations, or something like that.
The problem is that it isn't trivial to identify individual clusters across runs. Every time I permute the data and run PROC FASTCLUS, a given cluster may be given a completely different ID number. For example, I ran the procedure 5 times, and created a dataset where each row corresponds to an electrode and each column corresponds to a cluster assignment from a different run of the PROC. It looks a little like this:
(So, here, electrode96 is assigned to cluster 1 on the first two runs of the PROC, cluster 5 on the third and fourth, and cluster 2 on the fifth).
ELECTRODE RUN1 RUN2 RUN3 RUN4 RUN5
electrode96 1 1 5 5 2
electrode115 2 5 1 7 3
electrode10 5 6 6 8 4
electrode101 5 6 6 8 4
electrode102 5 6 6 8 4
electrode103 5 2 6 8 4
So, you see, I would clearly consider the cluster with electrodes 10,101, and 102 to be consistent across runs. However, I can't figure out a way to automatically identify such a cluster, since the ID number assigned to that group of observations isn't consistent across runs. Similarly, there is the additional issue of examples like electrode103, which is MOSTLY the same across runs, but is occasionally different.
So, in this super simplified example, I would say want a way to identify that electrodes 10, 101, 102, and 103 are one cluster, whereas electrodes96 and 115 are different clusters. I just can't think of an efficient way to do this; in a suitably small set of runs, it is easy enough to do through visual inspection or through a simple series of logicals in a DATA step or PROC SQL.
However, with 121 electrodes and some arbitrarily large number of runs, I can't think of an easy way to check cluster agreement. A similar, but not identical, question was asked and answered here. It may be that a similar framework would work for this question, but I am having a difficult time determining how to adapt that solution towards this issue (that question is more concerned with checking agreement between non-exclusive clusters within a single run, whereas I am interested in agreement between exclusive clusters across runs). I suspect that there may be an easy way to do this using the hash language in the DATA step, but I don't have nearly enough experience with using that to properly manipulate it.
Does anybody have any suggestions? Here is the table I gave above as a SAS dataset:
DATA clusters; infile datalines dsd; input electrode :$12. run1 run2 run3 run4 run5; datalines; electrode96,1,1,5,5,2 electrode115,2,5,1,7,3 electrode10,5,6,6,8,4 electrode101,5,6,6,8,4 electrode102,5,6,6,8,4 electrode103,5,2,6,8,4 ; run;
In the end, I would want something approximating the following:
DATA final_assignments; infile datalines dsd; input electrode :$12. cluster; datalines; electrode96,1 electrode115,2 electrode10,3 electrode101,3 electrode102,3 electrode103,3 ; run;
09-28-2015 12:54 PM
Seeing as nobody responded, I am wondering if my problem/question are not clear enough? If you don't entirely understand what I am trying to accomplish, I can rewrite my question.
09-28-2015 03:04 PM
I'm not sure what you really mean by an abitrarily large number of times in this context. A quick check with the perm function says that with permuting 8 of the 25 elements, if you ran 1 series per second and had started at midnight 01/01/1960 the job would finish on 12/01/3341.
09-28-2015 03:28 PM
"Arbitrarily large number of times" means exactly what it means. As you say, it isn't feasible for me to calculate very single possible permutation of the elements. But that's true for almost all situations in which bootstrap or permuation resampling are done. "Arbitrarily large number of times" simply means that I plan on taking a few hundred or a few thousand resamples, whatever number would make me feel comfortable that I've adequately compensated for the sample size (or, at least, adequate in the sense of "to the best of my ability given the limitations of the data"). I wasn't specific because I don't feel that it is specifically relevant to my question; any suitably automated solution to this problem will be independent of whether or not I am checking agreement between 10, 50, or 100 cluster assignments.