BookmarkSubscribeRSS Feed
RyanSimmons
Pyrite | Level 9

Hello,

 

I have a set of data that I want to apply a k-means clustering algorithm to, using PROC FASTCLUS.

 

Now, one of the issues that I face is that the dataset is relatively small; it consists of 25 individuals with 121 measurements per individual. In fact, it is EEG data from 25 individuals, so each measurement corresponds to the level of activity detected by a specific electrode in the array, and I am looking for emergent groupings of activity in the electrodes that may correspond to salient physiological regions of the brain.

 

As I am sure you are aware, clustering algorithms like PROC FASTCLUS are sensitive to the order of the data, especially in small samples. For now, I won't go into the reasons I prefer to use PROC FASTCLUS versus one of the other available methods; the question I face is independent of the clustering procedure anyway, but I am open to suggestions for other procedures that may mitigate or avoid the issue I'm about to describe.

 

Anyway, all that aside, my plan for finding clusters is to try and overcome the small sample size by repeatedly permuting the order of the input dataset, and run the clustering procedure some arbitrarily large number of times. In theory, this will allow me to stabilize the cluster assignments by choosing the final assignment based on some percentage cut off. For example, electrode 1 may be assigned to cluster 1 because it is assigned to that cluster in 80% of the permutations, or something like that.

 

The problem is that it isn't trivial to identify individual clusters across runs. Every time I permute the data and run PROC FASTCLUS, a given cluster may be given a completely different ID number. For example, I ran the procedure 5 times, and created a dataset where each row corresponds to an electrode and each column corresponds to a cluster assignment from a different run of the PROC. It looks a little like this:

 

(So, here, electrode96 is assigned to cluster 1 on the first two runs of the PROC, cluster 5 on the third and fourth, and cluster 2 on the fifth).

 

ELECTRODE      RUN1     RUN2    RUN3    RUN4     RUN5

electrode96            1             1            5           5              2

electrode115          2             5            1           7              3

electrode10            5             6            6           8              4

electrode101          5             6            6           8              4         

electrode102          5             6            6           8              4

electrode103          5             2            6           8              4

 

So, you see, I would clearly consider the cluster with electrodes 10,101, and 102 to be consistent across runs. However, I can't figure out a way to automatically identify such a cluster, since the ID number assigned to that group of observations isn't consistent across runs. Similarly, there is the additional issue of examples like electrode103, which is MOSTLY the same across runs, but is occasionally different.

 

So, in this super simplified example, I would say want a way to identify that electrodes 10, 101, 102, and 103 are one cluster, whereas electrodes96 and 115 are different clusters. I just can't think of an efficient way to do this; in a suitably small set of runs, it is easy enough to do through visual inspection or through a simple series of logicals in a DATA step or PROC SQL.

 

However, with 121 electrodes and some arbitrarily large number of runs, I can't think of an easy way to check cluster agreement. A similar, but not identical, question was asked and answered here. It may be that a similar framework would work for this question, but I am having a difficult time determining how to adapt that solution towards this issue (that question is more concerned with checking agreement between non-exclusive clusters within a single run, whereas I am interested in agreement between exclusive clusters across runs). I suspect that there may be an easy way to do this using the hash language in the DATA step, but I don't have nearly enough experience with using that to properly manipulate it.

 

Does anybody have any suggestions? Here is the table I gave above as a SAS dataset:

 

DATA clusters;
infile datalines dsd;
input electrode :$12. run1 run2 run3 run4 run5;
datalines;
electrode96,1,1,5,5,2
electrode115,2,5,1,7,3
electrode10,5,6,6,8,4
electrode101,5,6,6,8,4
electrode102,5,6,6,8,4
electrode103,5,2,6,8,4
;
run;

In the end, I would want something approximating the following:

 

DATA final_assignments;
infile datalines dsd;
input electrode :$12. cluster;
datalines;
electrode96,1
electrode115,2
electrode10,3
electrode101,3
electrode102,3
electrode103,3
;
run;

 

5 REPLIES 5
RyanSimmons
Pyrite | Level 9

Seeing as nobody responded, I am wondering if my problem/question are not clear enough? If you don't entirely understand what I am trying to accomplish, I can rewrite my question.

ChrisHemedinger
Community Manager
Ryan, I moved your post to the Statistical Procedures board. That might generate more response.
It's time to register for SAS Innovate! Join your SAS user peers in Las Vegas on April 16-19 2024.
ballardw
Super User

I'm not sure what you really mean by an abitrarily large number of times in this context. A quick check with the perm function says that with permuting 8 of the 25 elements, if you ran 1 series per second and had started at midnight 01/01/1960 the job would finish on 12/01/3341.

RyanSimmons
Pyrite | Level 9

"Arbitrarily large number of times" means exactly what it means. As you say, it isn't feasible for me to calculate very single possible permutation of the elements. But that's true for almost all situations in which bootstrap or permuation resampling are done. "Arbitrarily large number of times" simply means that I plan on taking a few hundred or a few thousand resamples, whatever number would make me feel comfortable that I've adequately compensated for the sample size (or, at least, adequate in the sense of "to the best of my ability given the limitations of the data"). I wasn't specific because I don't feel that it is specifically relevant to my question; any suitably automated solution to this problem will be independent of whether or not I am checking agreement between 10, 50, or 100 cluster assignments.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1497 views
  • 0 likes
  • 3 in conversation