BookmarkSubscribeRSS Feed
mjbstats
Calcite | Level 5
Hello,

I am attempting to identify optimal clustering of a relatively small data set (n=900 observations) using only one dependent variable (which corresponds to a categorical variable, zip code).

My interpretation of the SAS documentation is as follows:

1) I want disjoint clusters, in that I want groupings of the zip codes based on their similarity with respect to the dependent variable, and I do not want any zip code assigned to more than one cluster.

2) A number of procedures might work, but the simplicity of my application seems to indicate that FASTCLUS or CLUSTER are good starting procs.

3) My criteria for choosing "optimal" clustering is good differentiation between the clusters based on the dependent variable--which means I want to minimize the within-cluster variance and maximize the between-cluster variance.

4) I find a lot of the advanced Clustering Analysis discussion confusing (e.g., the role of nonparametric probability density estimates in various methods).

My clusters tend to be poorly separated. Some observations are clearly apart from others (and can be clustered as such), but the rest of the data is somewhat uniformly distributed across the range of values. Still, even for the uniformly distributed data, we'd like to break the observations into reasonable groups based on where they fall within the range of values. We're shooting for 15-20 or so clusters.

Can anyone provide some guidance as to appropriate procedures and smoothing parameters for my application?

Many thanks.
3 REPLIES 3
Paige
Quartz | Level 8
I don't think you can run a Cluster Analysis of zip code. Simply doesn't make any sense to use a categorical variable in Cluster Analysis.

Can you state in words what results you are looking for from this zip code information, instead of stating "Cluster Analysis" is what you are looking for? Message was edited by: Paige
mjbstats
Calcite | Level 5
I am not clustering on zip code, but rather on a measure (let' s say it's population density for sake of argument) that exists per zip code.

The clusters are to be based on similar values of the dependent variable.
goladin
Calcite | Level 5
Hi,

I think you should run Proc Cluster leaving the zip code out and obtain the CCC. Once you have it, you can then run Proc fast clus.

Take a look.

http://support.sas.com/forums/thread.jspa?messageID=45504뇀

Regards,
Murphy

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 753 views
  • 0 likes
  • 3 in conversation