turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Cluster Analysis Newbie

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-08-2010 11:19 AM

Hello,

I am attempting to identify optimal clustering of a relatively small data set (n=900 observations) using only one dependent variable (which corresponds to a categorical variable, zip code).

My interpretation of the SAS documentation is as follows:

1) I want disjoint clusters, in that I want groupings of the zip codes based on their similarity with respect to the dependent variable, and I do not want any zip code assigned to more than one cluster.

2) A number of procedures might work, but the simplicity of my application seems to indicate that FASTCLUS or CLUSTER are good starting procs.

3) My criteria for choosing "optimal" clustering is good differentiation between the clusters based on the dependent variable--which means I want to minimize the within-cluster variance and maximize the between-cluster variance.

4) I find a lot of the advanced Clustering Analysis discussion confusing (e.g., the role of nonparametric probability density estimates in various methods).

My clusters tend to be poorly separated. Some observations are clearly apart from others (and can be clustered as such), but the rest of the data is somewhat uniformly distributed across the range of values. Still, even for the uniformly distributed data, we'd like to break the observations into reasonable groups based on where they fall within the range of values. We're shooting for 15-20 or so clusters.

Can anyone provide some guidance as to appropriate procedures and smoothing parameters for my application?

Many thanks.

I am attempting to identify optimal clustering of a relatively small data set (n=900 observations) using only one dependent variable (which corresponds to a categorical variable, zip code).

My interpretation of the SAS documentation is as follows:

1) I want disjoint clusters, in that I want groupings of the zip codes based on their similarity with respect to the dependent variable, and I do not want any zip code assigned to more than one cluster.

2) A number of procedures might work, but the simplicity of my application seems to indicate that FASTCLUS or CLUSTER are good starting procs.

3) My criteria for choosing "optimal" clustering is good differentiation between the clusters based on the dependent variable--which means I want to minimize the within-cluster variance and maximize the between-cluster variance.

4) I find a lot of the advanced Clustering Analysis discussion confusing (e.g., the role of nonparametric probability density estimates in various methods).

My clusters tend to be poorly separated. Some observations are clearly apart from others (and can be clustered as such), but the rest of the data is somewhat uniformly distributed across the range of values. Still, even for the uniformly distributed data, we'd like to break the observations into reasonable groups based on where they fall within the range of values. We're shooting for 15-20 or so clusters.

Can anyone provide some guidance as to appropriate procedures and smoothing parameters for my application?

Many thanks.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to mjbstats

11-08-2010 12:31 PM

I don't think you can run a Cluster Analysis of zip code. Simply doesn't make any sense to use a categorical variable in Cluster Analysis.

Can you state in words what results you are looking for from this zip code information, instead of stating "Cluster Analysis" is what you are looking for? Message was edited by: Paige

Can you state in words what results you are looking for from this zip code information, instead of stating "Cluster Analysis" is what you are looking for? Message was edited by: Paige

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Paige

11-08-2010 03:49 PM

I am not clustering on zip code, but rather on a measure (let' s say it's population density for sake of argument) that exists per zip code.

The clusters are to be based on similar values of the dependent variable.

The clusters are to be based on similar values of the dependent variable.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to mjbstats

11-09-2010 06:50 AM

Hi,

I think you should run Proc Cluster leaving the zip code out and obtain the CCC. Once you have it, you can then run Proc fast clus.

Take a look.

http://support.sas.com/forums/thread.jspa?messageID=45504뇀

Regards,

Murphy

I think you should run Proc Cluster leaving the zip code out and obtain the CCC. Once you have it, you can then run Proc fast clus.

Take a look.

http://support.sas.com/forums/thread.jspa?messageID=45504뇀

Regards,

Murphy