k means clustering in SAS

jeremyyjm — Wed, 06 Jun 2018 19:54:24 GMT

After I used the k means clustering using proc fastclus in SAS multiple times (K=1 to 5), I found that k=3 the number of cluster that I want. But the question is : if I want to plot them in two dimension plot, if need to use some variable reduction method to reduce the dimension, but which methods do I use? What is the difference between CPA and CDA in this case, someone pls help me!!! (I have attached outdata3 file)

cannonical discriminant analysis
proc candisc data=outdata3 out=clustcan ncan=2;
class cluster;
var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 schconn1
parpres paractv famconct;
run;

principle component analysis
proc princomp data=outdata3 out=clustprin n=2;
var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 schconn1
parpres paractv famconct;
run;

proc sgplot data=clustcan;
scatter y=can2 x=can1/group=cluster;
run;
quit;

proc sgplot data=clustprin;
scatter y=prin2 x=prin1/group=cluster;
run;
quit;

Re: k means clustering in SAS

RalphAbbey — Tue, 13 Nov 2018 16:46:46 GMT

In regards to dimension reduction for the purpose of visualization, there isn't necessarily a correct or incorrect answer. You have identified two good techniques, but these techniques do something slightly differently. This will mean that your understanding of the plots that they produce need to be different.

Canonical Discriminant Analysis will use the cluster variable and create a projection that is based upon the cluster labels that you have assigned. That this means, is that CDA will try to find the linear combination of inputs that has the highest correlation with the cluster label. You can think of this as the "best" (given the metric used in CDA) projection of the data for the purpose of seeing what linear combination best separates the cluster labels.

Principal Component Analysis will not consider the cluster labels. This could be more useful if you want to see how the clustering looks in a lower dimension without using the cluster information to bias your projection. The projection of the data is not dependent on how you cluster, but is instead the "best" with respect to the variance of the data, so you can see the data, and then see how the cluster labels are distributed across your projected space.

Ultimately the dimension reduction methods answer slightly different questions, and what you're trying to with the dimension reduction and plotting should inform which route that you go.

I hope this helped!

topic Re: k means clustering in SAS in SAS Data Science

k means clustering in SAS

Re: k means clustering in SAS