BookmarkSubscribeRSS Feed
mjbstats
Calcite | Level 5
Hello,

I am attempting to identify optimal clustering of a relatively small data set (n=900 observations) using only one dependent variable (which corresponds to a categorical variable, zip code).

My interpretation of the SAS documentation is as follows:

1) I want disjoint clusters, in that I want groupings of the zip codes based on their similarity with respect to the dependent variable, and I do not want any zip code assigned to more than one cluster.

2) A number of procedures might work, but the simplicity of my application seems to indicate that FASTCLUS or CLUSTER are good starting procs.

3) My criteria for choosing "optimal" clustering is good differentiation between the clusters based on the dependent variable--which means I want to minimize the within-cluster variance and maximize the between-cluster variance.

4) I find a lot of the advanced Clustering Analysis discussion confusing (e.g., the role of nonparametric probability density estimates in various methods).

My clusters tend to be poorly separated. Some observations are clearly apart from others (and can be clustered as such), but the rest of the data is somewhat uniformly distributed across the range of values. Still, even for the uniformly distributed data, we'd like to break the observations into reasonable groups based on where they fall within the range of values. We're shooting for 15-20 or so clusters.

Can anyone provide some guidance as to appropriate procedures and smoothing parameters for my application?

Many thanks.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 0 replies
  • 1090 views
  • 0 likes
  • 1 in conversation