BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MrTh
Obsidian | Level 7

Hi community

it's probably simple for most of you but here go:

I have a file of 800 observations (=rows) x 4 columns. The columns are from a PCA using an external software that gives me the first  4 eigenvectors (=component) for each observation. I plot those using PROC G3D and get the nice plot attached which is what I want when I plot the first 3 eigenvectors and show some clusters of observations.

My question is that I'd like to identify which observation belong to which cluster. I tried PROC FASTCLUS that give something close to what the graph shows but different for 1/3 of the observations.

I was wondering if there was some clever trick to do something like that. I was thinking along the K-MEAN algo.

 

When I mean identify which obs belong to which cluster, I mean in an automated way, not if eig1 > x and eig2 < y and ... 🙂

thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@MrTh wrote:

My apologies

I have attached the SAS file and the simple code is below.

the pb I se it that in the last step (proc freq) only 47 TX are in cluster 1 and the rest in in cluster 2 with CL. that should not be.

 


Please describe why "that should not be". Are you expecting each of your Code values to be in a single cluster?

If you run Proc Means on your data by Code you will see that the STD is largest, by and order of magnitude, for Eig3 in the TX code compared to the rest fo the data. Or similarly the Range of values for TX Eig3 has the widest range. I suspect the 47 that make up the Cluster 1 are separating out because in the Eig3 dimension the values are much larger, in the cluster, than for the other variables. That is the vertical stringing you see in the graph you showed.

 

You may want to consider using proc STDIZE to standardize the values but I don't have a suggestion which options may "help". Bu if you can get a transform that reduces that Eig3 spread you may get your expectation.

 

You may want to look at what happens when you use Maxclusters = 5. Your TX code gets split into 3 clusters, as the image suggests, that do not over lap with the CL and SU.

View solution in original post

5 REPLIES 5
ballardw
Super User

As a minimum I suggest providing the Proc Fastclus code that you used. Better would be to provide your data.

 

 

MrTh
Obsidian | Level 7

My apologies

I have attached the SAS file and the simple code is below.

the pb I se it that in the last step (proc freq) only 47 TX are in cluster 1 and the rest in in cluster 2 with CL. that should not be.

data out.sas0 (keep = ID code eig1 eig2 eig3) ;
set eig1 ;run ; 
proc g3d data=out.sas0 ;
scatter eig1*eig2=eig3 / noneedle grid size=0.8 tilt=70 ; run ; quit ; 
proc fastclus data = out.sas0 out=sas0clus maxclusters=3 maxiter=100 converge=0 ;
id code ;
var eig1 eig2 eig3 ; run ;
proc freq data = sas0clus ;
tables code * cluster / nocol norow nopercent ; run ; quit ; 

 

ballardw
Super User

@MrTh wrote:

My apologies

I have attached the SAS file and the simple code is below.

the pb I se it that in the last step (proc freq) only 47 TX are in cluster 1 and the rest in in cluster 2 with CL. that should not be.

 


Please describe why "that should not be". Are you expecting each of your Code values to be in a single cluster?

If you run Proc Means on your data by Code you will see that the STD is largest, by and order of magnitude, for Eig3 in the TX code compared to the rest fo the data. Or similarly the Range of values for TX Eig3 has the widest range. I suspect the 47 that make up the Cluster 1 are separating out because in the Eig3 dimension the values are much larger, in the cluster, than for the other variables. That is the vertical stringing you see in the graph you showed.

 

You may want to consider using proc STDIZE to standardize the values but I don't have a suggestion which options may "help". Bu if you can get a transform that reduces that Eig3 spread you may get your expectation.

 

You may want to look at what happens when you use Maxclusters = 5. Your TX code gets split into 3 clusters, as the image suggests, that do not over lap with the CL and SU.

PaigeMiller
Diamond | Level 26

I find it difficult to believe that any canned algorithm will exactly match your eye's detection of clusters* from a plot that can be seen at arbitrary rotations. 

 

If you really want to do this, you might try JMP software (gasp!) in which case you can plot and rotate points in three dimensions and then with your mouse, select the points in a cluster detected by your eyes, and those points selected will be highlighted in your data table, and you can then filter out the non-highlighted data points (or alternatively, you can choose a specific color for the points in the cluster).

 

Asterisk — I suppose if the clusters are extremely well separated in the data, then probably clustering algorithms ought to find them just as well as your eyes can. But otherwise ...

 

--
Paige Miller
MrTh
Obsidian | Level 7

fair enough. I accept that. I hadn't consider JMP as I never used it. I might give it a go. many thanks Paige

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 576 views
  • 2 likes
  • 3 in conversation