BookmarkSubscribeRSS Feed
DuaneTiemann
Fluorite | Level 6

I would like to discover clusters of simple line plots. I ran CORR on the plots and subtracted the correlations from 1 to get "distances" between each plot.

 

I was surprised to see that CLUSTER did not always provide low level clusters of the closest plots with any of the methods that I tried. I expect that this is because CLUSTER treats each column sort of as a position in an 'n space' dimension. i.e. it does not rely on the distance calculated by CORR between 2 plots to determine the distance to use and doesn't know that column names match id variable values.  I tried Type=DISTANCE as well with no success, though I can't claim to understand how distance is treated differently from coordinates.

 

The X axis range for the plots varies so the overlap between plots is inconsistent which may be what allows 2 highly correlated plots to have more variability in correlations with less related plots.

 

I was hoping to find small clusters of the most correlated plots that then comprise larger clusters, and so on. Is there a way to do that? Or do I need to code it myself using the agglomerate paradigm? Or am I doing something dumb?

I'm no expert at clustering so I wouldn't be surprised to find I have a conceptual issue.

 

Note CORR reports VADAX and MVCAX are the 2nd most correlated plot pair, but they do not comprise a low level cluster.

 

FWIW SAS 3.5 University Edition

 

Thanks, Duane

2 REPLIES 2
gergely_batho
SAS Employee

Your assuptions are correct: By default proc cluster "treats each column sort of as a position in an 'n space' dimension".

And yes, you have to use type=distance to change this behavior.

The trick is, that when you use  id LeftName; only the rows in the distance matrix are identified.

The column names are ignored! In the distance matrix the columns must be in the same order as the rows!

 

Code solution:

proc sort data=lib.Corr1;
by leftName rightName;
run;
proc transpose data=lib.Corr1
out=lib.Corr1T;
by leftName;
id rightName;
run;

/*Now all the 0-s are in the diagonal of the distance matrix*/
proc cluster data=lib.Corr1T(type=distance)
outtree=lib.ClusterTree
method=average nosquare;
id LeftName;
run;

DuaneTiemann
Fluorite | Level 6

Thanks a lot.  That's very helpful.  Duane

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1452 views
  • 0 likes
  • 2 in conversation