What should be the Optimum Number of Cluster

arpitsharma27 · Posted 06-17-2020 01:26 PM

Team,

I have this:

%macro clustering_method(method=);
proc cluster data=cars method=&method. ccc outtree=tree_&method. noprint;
where type="Sports";
by type;
	var horsepower mpg_highway weight wheelbase;
run;
proc sort data=tree_&method. out=&method.(keep= type _ncl_ _ccc_ );
by type _ncl_ _ccc_ ;
where not missing(_ccc_);
run;

%mend;
%clustering_method(method=Average);
%clustering_method(method=median);
%clustering_method(method=centroid);
%clustering_method(method=mcquitty);
%clustering_method(method=ward);

data Have;
set Average 
	Median 
	Centroid 
	McQuitty 
	Ward 
		indsname=source;
input_ds=scan(source,2,'.');;
run;

Referring to the Have dataset. What should be my optimum number of Clusters ? and Why?

Please advise.

Thanks

PaigeMiller · Posted 06-17-2020 02:28 PM

You are getting only negative numbers for CCC. This implies (to me) that there is no clustering. Also see https://www.researchgate.net/post/Could_someone_help_me_decide_the_ideal_noof_clusters_from_the_pseu...

which says

CCC is the cubic clustering criterion; the idea behind it is to compare the R squared you get with a specific number of clusters versus the R squared you would get by clustering a uniformly distributed set of points. That is, you interpret it similarly as you would R squared. You are getting STRICTLY negative values (and, in fact, they are decreasing with additional number of clusters before increasing again; I would interpret that increase as overfitting). This means that the model you are fitting to the data with X number of clusters fits worse than uniformly distributed points. This is evidence of a lack of clustering (or problems with the data).

--
Paige Miller

Ksharp · Posted 06-18-2020 08:19 AM

It is a world unsolved problem.

If I were you, I would try Primary Component Analysis.

Anyway, @Rick_SAS maybe have some ideas .

arpitsharma27 · Posted 06-18-2020 09:24 AM

Thank You @PaigeMiller & @Ksharp

I know _ccc_ is strictly negative.

I am already using PCA too.

The idea is to get outliers from 2 different algorithms and then join to get the output.

PCA-- was able to handle this.

But

KNN-- is looking for better selection of variables. Just dumping variables for KNN to figure out the cluster does not seem to be the correct thing to do.

Thank You to the Legends.

MelodieRush · Posted 06-18-2020 09:24 AM

Aligned Box Criterion is available in the HP Cluster node in SAS Enterprise Miner. It will determine the optimum number of clusters. Here's a video that talks about using this option, along with using CCC and gap methods https://www.youtube.com/watch?v=NZpNTkfT47c

Catch the SAS Global Forum keynotes, announcements, and tech content!
sasglobalforum.com | #SASGF

What should be the Optimum Number of Cluster

Re: What should be the Optimum Number of Cluster

Re: What should be the Optimum Number of Cluster

Re: What should be the Optimum Number of Cluster

Re: What should be the Optimum Number of Cluster

What should be the Optimum Number of Cluster

Re: What should be the Optimum Number of Cluster

Re: What should be the Optimum Number of Cluster

Re: What should be the Optimum Number of Cluster

Re: What should be the Optimum Number of Cluster

Ready to join fellow brilliant minds for the SAS Hackathon?