BookmarkSubscribeRSS Feed
arpitsharma27
Calcite | Level 5

Team,

 

I have this:

%macro clustering_method(method=);
proc cluster data=cars method=&method. ccc outtree=tree_&method. noprint;
where type="Sports";
by type;
	var horsepower mpg_highway weight wheelbase;
run;
proc sort data=tree_&method. out=&method.(keep= type _ncl_ _ccc_ );
by type _ncl_ _ccc_ ;
where not missing(_ccc_);
run;

%mend;
%clustering_method(method=Average);
%clustering_method(method=median);
%clustering_method(method=centroid);
%clustering_method(method=mcquitty);
%clustering_method(method=ward);

data Have;
set Average 
	Median 
	Centroid 
	McQuitty 
	Ward 
		indsname=source;
input_ds=scan(source,2,'.');;
run;

Referring to the Have dataset. What should be my optimum number of Clusters ? and Why?

 

Please advise.

 

Thanks

 

 

 

 

4 REPLIES 4
PaigeMiller
Diamond | Level 26

You are getting only negative numbers for CCC. This implies (to me) that there is no clustering. Also see https://www.researchgate.net/post/Could_someone_help_me_decide_the_ideal_noof_clusters_from_the_pseu...

which says

 

CCC is the cubic clustering criterion; the idea behind it is to compare the R squared you get with a specific number of clusters versus the R squared you would get by clustering a uniformly distributed set of points. That is, you interpret it similarly as you would R squared. You are getting STRICTLY negative values (and, in fact, they are decreasing with additional number of clusters before increasing again; I would interpret that increase as overfitting). This means that the model you are fitting to the data with X number of clusters fits worse than uniformly distributed points. This is evidence of a lack of clustering (or problems with the data). 

--
Paige Miller
Ksharp
Super User

It is a world unsolved problem.

If I were you, I would try Primary Component Analysis.

Anyway,  @Rick_SAS  maybe have some ideas .

arpitsharma27
Calcite | Level 5

Thank You @PaigeMiller  & @Ksharp 

I know _ccc_ is strictly negative.

 

I am already using PCA too.

The idea is to get outliers from 2 different algorithms and then join to get the output.

 

PCA-- was able to handle this.

But

KNN-- is looking for better selection of variables. Just dumping variables for KNN to figure out the cluster does not seem to be the correct thing to do.

Thank You to the Legends.

MelodieRush
SAS Employee

Aligned Box Criterion is available in the HP Cluster node in SAS Enterprise Miner. It will determine the optimum number of clusters. Here's a video that talks about using this option, along with using CCC and gap methods https://www.youtube.com/watch?v=NZpNTkfT47c

Catch the SAS Global Forum keynotes, announcements, and tech content!
sasglobalforum.com | #SASGF



sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 646 views
  • 0 likes
  • 4 in conversation