topic Re: proc cluster for mixed data in SAS Procedures

proc cluster for mixed data

datalligence — Tue, 24 Jun 2008 12:32:32 GMT

I have a data set of about 600,000 obs. The variables I would like to use for grouping observations/transactions include numeric and categorical variables.

In PROC CLUSTER, which METHOD or distance measure would be the most appropriate?

Re: proc cluster for mixed data

Olivier — Tue, 24 Jun 2008 18:06:32 GMT

Hi.
1) You will wait a long time for CLUSTER to cope with computations on such a big amount of observations. Consider using FASTCLUS to do the job, or at least create first-level clusters that would be processed afterwards (the two-stage method, I think the correct name for the method is when you look in the SAS help).
2) Use PRINQUAL or CORRESP procedures to pre-process your data : these can create numeric (continuous) variables summarizing information in categorical variables. Then merge with the already existing numeric information. And then cluster.
Regards.
Olivier

Re: proc cluster for mixed data

datalligence — Wed, 25 Jun 2008 06:11:53 GMT

FASTCLUS has a lot of limitations, and is not suitable for mixed data.

I guess I will have to use PROC DISTANCE with Gower's dissimilarity. But when I run PROC CLUSTER, which distance method will be the most appropriate?

Thanks,
Romakanta