Programming the statistical procedures from SAS

k means clustering

Reply
Contributor
Posts: 25

k means clustering

Hi all,

I have been working on data having more than 60k obs to find outliers by k-means clustering, but am not able to interpret it, can someone guide me how to interpret its o/p and how to detect outliers from k-means o/p.

thanks in advance,

Regards

manoj

PROC Star
Posts: 1,269

Re: k means clustering

What does your code look like?

Contributor
Posts: 25

Re: k means clustering

proc fastclus data = sample out = name converge = 0 maxclusters = 25 maxiter = 50;

var income sales region state cars place  manufacturer;

run;

Super User
Posts: 13,508

Re: k means clustering

From the documentation on Fastclus:

The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables.

 

From the names of your variables I would doubt that region, state, place or manufacturer are quantitative variables but instead are categorical. Which means they are likely to be more useful as ID variables in FastClus.

If Cars is something like the number of cars sold, purchased or registered then it likely is a VAR variable.

You also want to consider standardizing the variables as otherwise the variable with the largest overall variation is likely to dominate the cluster assignment.

 

The example in the documentation on outliers using proc fastclus shows a three step process, first getting some seeds or starting cluster centers, removes candidate clusters with few members, get measures for the observations with distance of the observation from its cluster seed and then reassigning the observations to a final cluster(which step you likely don't want).

 

http://documentation.sas.com/?cdcId=statcdc&cdcVersion=14.2&docsetId=statug&docsetTarget=statug_fast...

Contributor
Posts: 25

Re: k means clustering

@ballardw thank you for sharing the link, I have gone through procedure mentioned in the link that you have provided. At one section, after deleting clusters with very low frequency, they have created only two clusters for detecting outliers based on previous seed analysis. I just want to know is it thumb rule to take two clusters only or any other reason;

actually my dataset contains ~70K obs with 6 vars. so, in this case, how many I can consider?

Capture1.JPG

 

Regards

S Manoj

Ask a Question
Discussion stats
  • 4 replies
  • 204 views
  • 0 likes
  • 3 in conversation