BookmarkSubscribeRSS Feed
datalligence
Fluorite | Level 6
I have a data set of about 600,000 obs. The variables I would like to use for grouping observations/transactions include numeric and categorical variables.

In PROC CLUSTER, which METHOD or distance measure would be the most appropriate?
2 REPLIES 2
Olivier
Pyrite | Level 9
Hi.
1) You will wait a long time for CLUSTER to cope with computations on such a big amount of observations. Consider using FASTCLUS to do the job, or at least create first-level clusters that would be processed afterwards (the two-stage method, I think the correct name for the method is when you look in the SAS help).
2) Use PRINQUAL or CORRESP procedures to pre-process your data : these can create numeric (continuous) variables summarizing information in categorical variables. Then merge with the already existing numeric information. And then cluster.
Regards.
Olivier
datalligence
Fluorite | Level 6
FASTCLUS has a lot of limitations, and is not suitable for mixed data.

I guess I will have to use PROC DISTANCE with Gower's dissimilarity. But when I run PROC CLUSTER, which distance method will be the most appropriate?

Thanks,
Romakanta
What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 3324 views
  • 0 likes
  • 2 in conversation