BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

 

I am doing Linkage Clustering with ward's method and i noticed the Pseudo F Statistic is like huge 40000-50000, I know it's bigger is better.  But is such huge number normal?

 

Also, i done Canonical discriminant analysis (Plot Below), my manager is questioning why some segment is so spread apart ie: segment 9, the light blue one.  I explain that it's only a 2D projection of all the factors so it's normal but he's not buying it.  How should i explain it to him.

 

 

proc cluster data=myData method=ward ccc pseudo;
var factor1-factor7.;
	copy clus0;
run;

proc tree noprint ncl=9 out=out;
	copy factor1-factor&numFactors. clus0;
run;

 

 

 

Number of ClustersClusters Joined    Freq  Semipartial R-Square R-Square  Approximate Expected R-SquareCubic Clustering CriterionPseudo F Statistic Pseudo t-Squared 
19OB15OB187360.00350.7080.5727644.00E+04460
18OB9OB1438610.00490.7030.5667724.20E+041880
17OB10CL1915210.00790.6950.5587634.30E+04852
16OB1OB1666580.00870.6870.5517534.40E+043047
15OB4OB13115770.01420.6730.5427124.40E+045594
14CL16OB581430.01430.6580.5336774.40E+043017
13CL18CL1753820.01430.6440.5236484.50E+042088
12OB12OB20187700.01670.6270.5126134.60E+041.50E+04
11OB8OB17679810.01680.610.55864.70E+042.50E+04
10CL12OB19417710.01950.5910.4865534.80E+041.50E+04
9OB7OB11156120.02040.5710.475295.00E+048577
8CL11CL101097520.0220.5490.4525115.20E+041.70E+04
7OB2OB61259000.05230.4960.4253464.90E+041.00E+05
6CL7CL151374770.0590.4370.392124.60E+045.10E+04
5CL14CL81178950.07460.3620.34574.24.30E+044.10E+04
4CL6CL91530890.0780.2840.29-214.00E+044.30E+04
3CL5CL42709840.08130.2030.216-573.80E+043.40E+04
2CL3OB32939840.10080.1020.125-1183.40E+044.00E+04
1CL2CL132993660.1023000.3.40E+04

 

 

SG.png

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

The light blue segment is spread out because:

 

The data in the light blue segment is spread out.

 

Regarding the comment by @mkeintz, yes, rotating the view would certainly cause some colors to look more spread out or less spread out, but when you have 2 dimensions and you plot the results in 2 dimensions I'm not sure why rotation is needed or meaningful. But anyway, you can't get spread out data unless the data itself is spread out; if the data was really dispersed over a TINY area, no rotation would change that.

 

Pseudo F statistic 40000-50000 because that's what the data is saying. Perhaps it's an outlier, perhaps the clusters really really really really are distinct and separated. How can we say without your data?

--
Paige Miller

View solution in original post

4 REPLIES 4
mkeintz
PROC Star

"he's not buying it?"  Tell him if you rotated the data in another dimension, all those clusters looking relatively compact in this plot may very will look quite dispersed from another angle.  And the one in question may look more compact.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PaigeMiller
Diamond | Level 26

The light blue segment is spread out because:

 

The data in the light blue segment is spread out.

 

Regarding the comment by @mkeintz, yes, rotating the view would certainly cause some colors to look more spread out or less spread out, but when you have 2 dimensions and you plot the results in 2 dimensions I'm not sure why rotation is needed or meaningful. But anyway, you can't get spread out data unless the data itself is spread out; if the data was really dispersed over a TINY area, no rotation would change that.

 

Pseudo F statistic 40000-50000 because that's what the data is saying. Perhaps it's an outlier, perhaps the clusters really really really really are distinct and separated. How can we say without your data?

--
Paige Miller
mkeintz
PROC Star
My point was to find a way to communicate to the manager that any 2-dimensional view of a 3-or-higher dimension set of colored dots can be distorted.

It may be that the 2 dimension chosen is optimized to chose the largest number of clusters as compact, possibly at the expense of apparent compactness of a particular cluster.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PaigeMiller
Diamond | Level 26

@mkeintz wrote:



It may be that the 2 dimension chosen is optimized to chose the largest number of clusters as compact, possibly at the expense of apparent compactness of a particular cluster.

Okay, but that's not what canonical correlation does.

--
Paige Miller

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1448 views
  • 0 likes
  • 3 in conversation