02-12-2018 12:53 AM
I have been working on data having more than 60k obs to find outliers by k-means clustering, but am not able to interpret it, can someone guide me how to interpret its o/p and how to detect outliers from k-means o/p.
thanks in advance,
02-12-2018 12:37 PM
From the documentation on Fastclus:
The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables.
From the names of your variables I would doubt that region, state, place or manufacturer are quantitative variables but instead are categorical. Which means they are likely to be more useful as ID variables in FastClus.
If Cars is something like the number of cars sold, purchased or registered then it likely is a VAR variable.
You also want to consider standardizing the variables as otherwise the variable with the largest overall variation is likely to dominate the cluster assignment.
The example in the documentation on outliers using proc fastclus shows a three step process, first getting some seeds or starting cluster centers, removes candidate clusters with few members, get measures for the observations with distance of the observation from its cluster seed and then reassigning the observations to a final cluster(which step you likely don't want).
02-14-2018 03:37 AM
@ballardw thank you for sharing the link, I have gone through procedure mentioned in the link that you have provided. At one section, after deleting clusters with very low frequency, they have created only two clusters for detecting outliers based on previous seed analysis. I just want to know is it thumb rule to take two clusters only or any other reason;
actually my dataset contains ~70K obs with 6 vars. so, in this case, how many I can consider?