BookmarkSubscribeRSS Feed
s_manoj
Quartz | Level 8

Hi all,

I have been working on data having more than 60k obs to find outliers by k-means clustering, but am not able to interpret it, can someone guide me how to interpret its o/p and how to detect outliers from k-means o/p.

thanks in advance,

Regards

manoj

4 REPLIES 4
PeterClemmensen
Tourmaline | Level 20

What does your code look like?

s_manoj
Quartz | Level 8

proc fastclus data = sample out = name converge = 0 maxclusters = 25 maxiter = 50;

var income sales region state cars place  manufacturer;

run;

ballardw
Super User

From the documentation on Fastclus:

The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables.

 

From the names of your variables I would doubt that region, state, place or manufacturer are quantitative variables but instead are categorical. Which means they are likely to be more useful as ID variables in FastClus.

If Cars is something like the number of cars sold, purchased or registered then it likely is a VAR variable.

You also want to consider standardizing the variables as otherwise the variable with the largest overall variation is likely to dominate the cluster assignment.

 

The example in the documentation on outliers using proc fastclus shows a three step process, first getting some seeds or starting cluster centers, removes candidate clusters with few members, get measures for the observations with distance of the observation from its cluster seed and then reassigning the observations to a final cluster(which step you likely don't want).

 

http://documentation.sas.com/?cdcId=statcdc&cdcVersion=14.2&docsetId=statug&docsetTarget=statug_fast...

s_manoj
Quartz | Level 8

@ballardw thank you for sharing the link, I have gone through procedure mentioned in the link that you have provided. At one section, after deleting clusters with very low frequency, they have created only two clusters for detecting outliers based on previous seed analysis. I just want to know is it thumb rule to take two clusters only or any other reason;

actually my dataset contains ~70K obs with 6 vars. so, in this case, how many I can consider?

Capture1.JPG

 

Regards

S Manoj

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1424 views
  • 0 likes
  • 3 in conversation