BookmarkSubscribeRSS Feed
jeremyyjm
Calcite | Level 5

After I used the k means clustering using proc fastclus in SAS multiple times (K=1 to 5), I found that k=3 the number of cluster that I want. But the question is : if I want to plot them in two dimension plot, if need to use some variable reduction method to reduce the dimension, but which methods do I use? What is the difference between CPA and CDA in this case, someone pls help me!!! (I have attached outdata3 file)

 

cannonical discriminant analysis
proc candisc data=outdata3 out=clustcan ncan=2;
class cluster;
var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 schconn1 
    parpres paractv famconct;
run;

principle component analysis
proc princomp data=outdata3 out=clustprin n=2;
var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 schconn1 
    parpres paractv famconct;
run;


proc sgplot data=clustcan;
scatter y=can2 x=can1/group=cluster;
run;
quit;


proc sgplot data=clustprin;
scatter y=prin2 x=prin1/group=cluster;
run;
quit;

1 REPLY 1
RalphAbbey
SAS Employee

In regards to dimension reduction for the purpose of visualization, there isn't necessarily a correct or incorrect answer. You have identified two good techniques, but these techniques do something slightly differently. This will mean that your understanding of the plots that they produce need to be different.

 

Canonical Discriminant Analysis will use the cluster variable and create a projection that is based upon the cluster labels that you have assigned. That this means, is that CDA will try to find the linear combination of inputs that has the highest correlation with the cluster label. You can think of this as the "best" (given the metric used in CDA) projection of the data for the purpose of seeing what linear combination best separates the cluster labels.

 

Principal Component Analysis will not consider the cluster labels. This could be more useful if you want to see how the clustering looks in a lower dimension without using the cluster information to bias your projection. The projection of the data is not dependent on how you cluster, but is instead the "best" with respect to the variance of the data, so you can see the data, and then see how the cluster labels are distributed across your projected space.

 

Ultimately the dimension reduction methods answer slightly different questions, and what you're trying to with the dimension reduction and plotting should inform which route that you go.

 

I hope this helped!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1462 views
  • 0 likes
  • 2 in conversation