Re: Cluster Analysis Newbie

mjbstats · Posted 11-08-2010 11:19 AM

Hello,

I am attempting to identify optimal clustering of a relatively small data set (n=900 observations) using only one dependent variable (which corresponds to a categorical variable, zip code).

My interpretation of the SAS documentation is as follows:

1) I want disjoint clusters, in that I want groupings of the zip codes based on their similarity with respect to the dependent variable, and I do not want any zip code assigned to more than one cluster.

2) A number of procedures might work, but the simplicity of my application seems to indicate that FASTCLUS or CLUSTER are good starting procs.

3) My criteria for choosing "optimal" clustering is good differentiation between the clusters based on the dependent variable--which means I want to minimize the within-cluster variance and maximize the between-cluster variance.

4) I find a lot of the advanced Clustering Analysis discussion confusing (e.g., the role of nonparametric probability density estimates in various methods).

My clusters tend to be poorly separated. Some observations are clearly apart from others (and can be clustered as such), but the rest of the data is somewhat uniformly distributed across the range of values. Still, even for the uniformly distributed data, we'd like to break the observations into reasonable groups based on where they fall within the range of values. We're shooting for 15-20 or so clusters.

Can anyone provide some guidance as to appropriate procedures and smoothing parameters for my application?

Many thanks.

Paige · Posted 11-08-2010 12:31 PM

I don't think you can run a Cluster Analysis of zip code. Simply doesn't make any sense to use a categorical variable in Cluster Analysis.

Can you state in words what results you are looking for from this zip code information, instead of stating "Cluster Analysis" is what you are looking for? Message was edited by: Paige

mjbstats · Posted 11-08-2010 03:49 PM

I am not clustering on zip code, but rather on a measure (let' s say it's population density for sake of argument) that exists per zip code.

The clusters are to be based on similar values of the dependent variable.

goladin · Posted 11-09-2010 06:50 AM

Hi,

I think you should run Proc Cluster leaving the zip code out and obtain the CCC. Once you have it, you can then run Proc fast clus.

Take a look.

http://support.sas.com/forums/thread.jspa?messageID=45504뇀

Regards,
Murphy

Cluster Analysis Newbie